๐ง NEXUS EVOLUTION PROOF ยท DEEP-DIVE DEBRIEF
Why the Dashboard Says ๐ด CRITICAL and the Plumbers Don't
April 17, 2026 ยท 9:55 PM Central Time ยท For: Robert Dove ยท Status: Diagnosis complete, no code changed
๐ฏ The 60-Second Story
- The dashboard's two ๐ด CRITICAL alerts are not a plumber problem. Techs didn't stop closing jobs.
- On Apr 12, we quarantined
nexus_titan_migration.py because it caused the Apr 3 phantom $6.4M fire. That was the right move.
- But that script had three jobs: (1) insert ST jobs, (2) backfill invoice totals, (3) write the
st_jobs_cache.json file. Only job #1 was reassigned (to the 15-min daemon). Jobs #2 and #3 became orphans.
- The two alerts are watching those orphans. The monitors don't know the script was retired.
- But under the monitors sits a chronic real gap: invoice_total is populated on only 9% to 55% of completed jobs across all of April, never higher. That is a real measurement problem, and it's why ST shows $13K/wk while Big Sale shows $226K/wk.
- Two separate problems, one loud and cosmetic, one quiet and architectural. The quiet one is the one that matters.
โ Patient Vitals what's actually running vs. what's not
๐ข Actually Alive
| Component | Last heartbeat |
titan_sync_daemon.py | 1 min ago โ
(every 15m) |
titan_invoice_sync.py | 05:15 CT today โ
|
| titan-killer.service (API) | active โ
|
| zeus modules (6 of 7) | all ๐ข |
| ST API auth | pulled 169 invoices today โ
|
Postgres INSERTs into titan.jobs | last 1h 33m ago โ
|
๐ด What the Monitors Yell About
| Alert | Why it fires |
Anomaly zero_invoice 75% | 6 of 8 recent jobs have no invoice_total |
| Data freshness: ST Jobs (6d) | st_jobs_cache.json mtime is Apr 12 |
Both are downstream of the same Apr 12 decision. Neither reflects a tech problem.
โก The Pipeline Map live flow of one ST job from creation to dashboard
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ๐ Customer calls ยท BSP schedules ยท Tech arrives ยท Job gets done โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
Writes into ServiceTitan
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ๐๏ธ ServiceTitan API (source of truth for operational data) โ
โ /jobs /invoices /customers /estimates โ
โโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโ
โ โ โ
every 15 min 05:15 CT daily RETIRED Apr 12
โ โ โ
โผ โผ โผ
โโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ
โ titan_sync_daemon โ โ titan_invoice_sync โ โ nexus_titan_migrationโ
โ โ โ โ โ (quarantined โ was โ
โ INSERT new jobs โ โ UPDATE invoice_ โ โ the $6.4M phantom) โ
โ INSERT customers โ โ total WHERE st_id โ โ โ
โ INSERT estimates โ โ matches โ โ Used to also write: โ
โ โ โ โ โ โข invoice backfill โ
โ (no job_number, โ โ 169 invoices/day โ โ โข job_number sync โ
โ no invoice_total,โ โ 86โ115 updates/day โ โ โข st_jobs_cache.jsonโ
โ no scheduled_at) โ โ โ โ โ
โโโโโโโโโโโฌโโโโโโโโโโ โโโโโโโโโโโโฌโโโโโโโโโโ โโโโโโโโโโโโฌโโโโโโโโโโโ
โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ๐๏ธ Postgres ยท bsp_analytics ยท titan.jobs (the live table) โ
โ 11,831 jobs ยท last insert 1h 33m ago โ
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโ
โ โ
โผ โผ
โโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ๐ง Dashboards + APIs โ โ ๐ฎ Anomaly detector โ
โ (HCP, Stephanie, โ โ reads zero_invoice โ
โ Big Sale, Audreyโฆ) โ โ on 8 recent jobs โ
โ โ โ ๐ด fires at 75% โ
โ SHOWS: broken ST rev โ โ โ
โ $13.7K/wk โ โ ๐ฎ Session enforcer โ
โ โ โ checks st_jobs_cache โ
โ Real revenue lives โ โ mtime ยท 6d old โ
โ in Big Sale $226K/wk โ โ ๐ด STALE โ
โโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โข The Chronic Gap invoice_total populated rate, by day
Every row below is completed jobs. If tech work flows into invoices flows into invoice_total, this bar should be mostly full. It is not. And it has not been all month.
| Day | Completed | With $ | Populated % | |
| Apr 18 (partial) | 1 | 0 | 0% |
|
| Apr 17 | 10 | 0 | 0% |
|
| Apr 16 | 10 | 0 | 0% |
|
| Apr 15 | 11 | 6 | 55% |
|
| Apr 14 | 19 | 2 | 11% |
|
| Apr 13 | 9 | 1 | 11% |
|
| Apr 10 | 9 | 3 | 33% |
|
| Apr 08 | 11 | 1 | 9% |
|
| Apr 04 | 4 | 2 | 50% |
|
| Apr 02 | 6 | 3 | 50% |
|
Average across April: ~26% of completed jobs carry a non-zero invoice_total. Peak day: 55%. That means 74% of the time, our own ST mirror does not know what a job was worth.
โฃ The Causal Chain Apr 3 fire ยท Apr 12 treaty ยท Apr 17 alert
๐
Apr 3 04:07 UTC
โผ
๐ฅ PHANTOM $6.4M discovered
nexus_titan_migration.py:249 INSERT missing created_at
10,461 jobs stamped same timestamp ยท scheduled_at spans 5 years
๐
Apr 3 โ Apr 12
โผ
๐งฏ Incident response ยท Evolution Protocols v1 published
29-file blast radius documented in BSP_Data_Trust_Evolution_v1.html
๐
Apr 12 (The Nexus Treaty)
โผ
๐ nexus_titan_migration.py โ one_time_migrations/ + chmod -x
๐ Postgres trigger guard added to titan.jobs (prevents bulk INSERT)
๐ titan_sync_daemon.py takes over job INSERTs (every 15 min)
11,729 phantom rows quarantined ยท 292 โ 128 timers
BUT โ three responsibilities were never reassigned:
โ Invoice total backfill on older jobs
โ job_number population
โ st_jobs_cache.json daily write
๐
Apr 12 โ Apr 17 (5 days)
โผ
๐ณ๏ธ Orphaned work piles up
Each day's completed jobs enter titan.jobs as skeletons and stay that way
๐
Apr 17 21:42 CT
โผ
๐จ Evolution Proof fires ๐ด๐ด
zero_invoice CRITICAL โ 6 of 8 is over 70% threshold
ST Jobs 6d stale โ cache file mtime is Apr 12 00:00
โค The Math why "75%" is both technically correct and statistically loud
Small-sample noise check
Sample: 8 jobs ยท 6 zero-invoice ยท point estimate 75%. Wilson 95% confidence interval: 40.9% to 93.0%. On 8 data points, the true rate could be 41% or it could be 93%. The threshold bar at 70% is inside that interval.
Sample
n = 8
jobs scheduled โค 7d
Zero-invoice
6
invoice_total is null or 0
Point rate
75%
fires at โฅ 70%
95% CI
41 โ 93%
Wilson score
But the signal is real at larger N
Widen to all completed jobs Apr 1 through Apr 18: 124 jobs total, 100 of them zero-invoice. That's an 80.6% zero rate on n=124. Wilson 95% CI: 73% to 87%. Statistically robust. The detector picked the wrong window, but the underlying finding is real.
Revenue implication
Evolution Proof reports $13,745/wk from ST and $226,703/wk from Big Sale. Ratio: 6.1%. If Big Sale is truth, ST is capturing only about 6% of real revenue. The 78% zero-invoice rate on our ST mirror and the 6% revenue capture ratio are the same story told two different ways.
โฅ Monitor vs Reality lane diagram
| Alert the dashboard shows | What it's measuring | What it means in reality |
| ๐ด CRIT zero_invoice 75% |
Last 8 jobs with scheduled_at in 7d |
Technically correct. Label ("techs not closing jobs") is wrong. Real cause: invoice sync doesn't backfill older jobs. |
| ๐ด STALE ST Jobs 6d |
mtime of st_jobs_cache.json |
File was written by the quarantined migration script. DB itself is fresh. Monitor watches a ghost. |
| ๐ช 56/100 Muscle score |
How much Nexus is acting on data |
Downstream of the first two. If ST mirror is 74% empty, action engines have thin signal. |
| ๐ข 6 of 7 data fresh |
ad_throttler, 3cx, st_enforce, ai_intake, anomaly_log, ads_audit |
Correct. All six write-every-day files are under 6h old. |
โฆ The Fix Menu ten levers, ranked by impact ร effort
| # | Lever | Impact | Effort | Risk |
| A1 |
Write a small job that has titan_sync_daemon.py also emit st_jobs_cache.json on each cycle |
Silences ST Jobs stale alert permanently |
~30 min |
๐ข Low |
| A2 |
Drop st_jobs_cache.json from DATA_SOURCES in nexus_session_enforcer_v2.py and replace with a live DB freshness query |
Silences the alert AND makes the freshness check accurate |
~20 min |
๐ข Low |
| B1 |
Raise check_zero_invoice_rate() minimum sample from any to n โฅ 20 |
Stops small-N false alarms without hiding real gap |
~10 min |
๐ข Low |
| B2 |
Also exclude jobs completed in last 24h (give invoice sync a chance) |
Cuts the residual daily lag noise |
~10 min |
๐ข Low |
| C1 |
Widen invoice sync window from 7 days to 30 days |
Should lift population rate substantially on older jobs |
~10 min + one re-run |
๐ก Medium |
| C2 |
Change invoice sync key from modifiedOnOrAfter to pull invoices for all open jobs in last 30d regardless of invoice modify date |
Catches jobs whose invoice was created but never modified |
~1 hr |
๐ก Medium |
| C3 |
Hook invoice sync to ST webhook invoice.updated so it is event-driven not cron |
Near-realtime population ยท strongest fix |
~3 hr (webhook listener exists) |
๐ก Medium |
| D1 |
Diagnostic: sample 10 recent zero-invoice jobs, call ST API directly, compare totals |
Tells us how many are sync-miss vs. genuinely $0 (warranty / declined) |
~20 min |
๐ข Low |
| D2 |
Remove the four cron lines that point into /purgatory/ and /backups/ (monday_sync, st_data_fixer, etc.) |
Cleans noise; no functional change |
~10 min |
๐ข Low |
| E1 |
Rename the zero_invoice alert label from "Techs not closing jobs" to "ST mirror invoice coverage gap" |
Stops misleading anyone who glances at the dashboard |
~2 min |
๐ข Low |
โง My Recommendation what to do Monday morning
๐ฏ Sequenced plan
- Today (quiet the noise): Do A2 + B1 + B2 + E1. All four are < 45 minutes total and do not touch the data pipeline. Dashboard goes ๐ข without pretending problems away.
- Diagnostic before fix (D1): Pick 10 zero-invoice jobs, pull their invoices from ST directly, find out the split between "sync missed it" and "job truly has no invoice yet". This decides whether C1 or C3 is the right fix.
- The real fix (C1 first, C3 later): Widen invoice sync window to 30 days as a one-line change. Watch the population rate for 48 hours. If still below 70%, promote to C3 event-driven.
- Cleanup (D2): Retire the dead cron lines. These have been erroring daily since Apr 12 and add noise to every log file.
What I did not do: I touched zero code on the VM. All findings are read-only. Saying "yes" to any of AโE means me making the change with you reviewing before I restart the service.
๐งญ How to read the Evolution Proof from now on
When the proof says ๐ด CRITICAL, ask three questions in this order:
- Does the DB itself say the data is stale? (Query
MAX(updated_at), not a cache file.)
- Is the sample size big enough to trust the percentage? (Under n=20, treat any rate as a hint not a verdict.)
- Is the ALERT LABEL telling you the cause, or just the symptom? Most of our labels describe symptoms.
The dashboard is a thermometer, not a diagnosis. It is very good at telling you something is off. It is not very good at telling you what.