5 fixes for the Nexus VM — 3 of 5 COMPLETE — $2/month total
61 API secrets + 52 config values + 13 token files sitting in /opt/nexus/nexus/config/.env. ServiceTitan, Google Ads, Facebook, Slack, Vapi, QuickBooks, Plaid, Gmail, Cloudflare, and 25+ more. If the VM is compromised, every API key is instantly exposed. Every system goes dark. $6.86M of customer data at risk.
Secrets fetched at runtime, never on disk. Audit trail on every access. Auto-rotation for expiring tokens. If VM is hacked, attacker gets nothing.
💡 Why it matters: These 136 credentials control access to $6.86M in customer data, $70K/week in revenue tracking, Daniel AI (65 calls/week), Google Ads ($500/day budget), and every Slack notification Ashton relies on. One breach = every system goes dark simultaneously.
ETA: This week. Robert migrates all .env vars to Secret Manager + updates Python scripts to use google-cloud-secret-manager library.
300+ Python files, PostgreSQL database, 59 timers, 136 credentials. On April 12, the auto-repair agent destroyed 19 files. Recovery took HOURS of manual rebuilding. With automated snapshots, it would have been a 5-minute rollback.
Automatic snapshot every day at 2 AM CT. Keep 7 days. One-click restore of the entire VM to any point in the last week. The Great Stabilization (Apr 12) would have been a 5-minute fix instead of hours.
💡 Why it matters: The VM is the brain of the entire operation. 300+ scripts, 65+ database tables, 379 experiments, every dashboard, every notification. If it goes down without a backup, Robert rebuilds for hours while Ashton gets zero lead notifications, Daniel AI goes silent, and Monday standup has no data.
ETA: Robert does this in GCP Console (VM auth scope doesn't allow CLI). Compute Engine → Snapshots → Create Schedule. 2 minutes.
379 experiments running on one VM. Only CPU was visible. Memory leaks, disk filling up from growing PostgreSQL — all invisible until a crash. The Paul Bertrand incident (Ashton missed 4 calls) could have been prevented if we saw the system was under stress.
Google Ops Agent monitors RAM + disk + logs. Nexus Health Worker checks RAM, disk, CPU load, PostgreSQL, Titan API, and critical services every 15 minutes. Slack alert if anything crosses threshold. Fixed Apr 17.
💡 Why it matters: 379 experiments + 59 timers + 385 API endpoints all share one VM. A single memory leak can cascade into API crashes → Ashton misses leads → customers lost. Now we see it coming and fix it BEFORE anyone notices.
If the VM rebooted and got a new IP, every webhook (Slack, Daniel AI, ST, Vapi) would break. Hours of rewiring each time. Google building two billion-dollar data centers in KC means future sub-2ms latency when a KC region opens.
IP is static and will not change on reboot. All webhooks, Slack integrations, and Daniel AI callbacks are safe. When Google opens a KC region, migrate for 5x faster API calls. Verified Apr 17.
💡 Why it matters: Every webhook, every Slack bot, every Daniel AI callback points to this IP. A dynamic IP change on reboot = hours of rewiring 34 API connections. Static IP means the VM is always reachable at the same address, forever. The KC data center future means the nexus_number_gate.py (12-source financial verification) runs 5x faster.
Paul Bertrand called at 7:05 AM CT but the logs showed "12:05 PM." Every timestamp, every timer, every cron job was 5 hours wrong. Ashton couldn't trust the notification times. Every team member had to mentally convert UTC to Central.
Set to America/Chicago (CDT, -0500). Zero-dollar alert adjusted to 6 AM CT. Weather engine to 6AM/12PM/6PM CT. Service watchdog to 6AM/6PM CT. All future Daniel AI call logs show correct KC time. Fixed Apr 17.
💡 Why it matters: Paul Bertrand called 4 times at 7 AM CT. The notification showed "12:05 PM." Ashton didn't know if these were morning calls or afternoon calls. Every timer that said "fire at 11:00" was actually firing at 6 AM CT without anyone knowing. The entire team was operating on a clock that was 5 hours wrong.