GCP Architecture Reference

A field reference for operating the Bright Side Plumbing nexus-vm production stack on Google Cloud, plus an onboarding map for a new junior web developer joining the team. The doc treats Google Cloud as the operating environment, calls out every place the BSP stack actually touches GCP, and flags the gotchas that bite single-VM workloads. No em dashes anywhere, dark mode by default, and every section closes with a production checklist.

🎓 FOR NEW HIRE, How to read this doc

Welcome. The fastest path to productivity is: skim sections 1, 2, 3, 6, and 15. Sections 1 and 15 cover the actual VM you will SSH into. Section 2 explains how Google decides whether your account or service account is allowed to do something. Section 3 explains why traffic from a browser actually reaches the VM. Section 6 is how we know anything is broken. Section 15 ties it together for our specific stack. Everything else is reference, dive in when you need it. Cloud is mostly Python, with shell glue and the gcloud CLI; Go is the language Google itself uses to build the platform; TypeScript shows up at the edges (Cloudflare Workers, Next.js, Bricks builder). Lean Python first.

🏗️ 1. Compute Engine HIGH PRIORITY

Compute Engine (GCE) is Google Cloud's IaaS layer. Our entire Nexus operational stack runs on a single GCE VM named nexus-vm at external IP 34.55.179.122. Everything in this section is calibrated for single-VM operations. Multi-VM, MIG, and regional patterns are summarized so you can recognize them, not deeply rehearsed.

1.1 The mental model: VM lifecycle and where state lives

A GCE VM is the composition of three independent objects: an instance (CPU/RAM/network attachment), one or more persistent disks (block storage that survives the instance), and a project + zone binding that scopes everything else (firewall rules, IAM, billing). When you "stop" a VM you keep the disks, lose the running RAM, and stop paying for vCPU/RAM but keep paying for disks and reserved static IPs. When you "delete" the instance you can choose to keep or delete each attached disk. Snapshots are the durable backup unit, they live in GCS-backed regional or multi-regional storage and are independent of the disk.

States in Compute Engine: PROVISIONING → STAGING → RUNNING → STOPPING → TERMINATED. There is also SUSPENDING/SUSPENDED for the suspend-to-disk flow which preserves RAM contents on a separate disk. Source: cloud.google.com/compute/docs/instances/instance-life-cycle.

⚠️ Gotcha, "stopped" still costs money

A stopped VM costs $0 for compute but you still pay for: attached persistent disks, attached GPUs that are reserved, reserved static external IPs (a static IP unattached to a running VM costs ~$0.005/hr, around $3.65/mo per address), and any committed-use discounts you bought. The "I'll stop the VM over the weekend to save money" play only works if you also detach unused IPs and sized disks correctly.

⚠️ Gotcha, regional vs zonal scoping

VMs and persistent disks are zonal. If us-central1-a goes down, your VM and its zonal PDs are inaccessible. Snapshots and images are multi-regional. Static external IPs are regional. Plan with this in mind: a snapshot can rebuild your VM in any zone, but a zonal disk cannot be cross-zone-attached, you must clone via snapshot. Source: cloud.google.com/compute/docs/regions-zones.

⚠️ Gotcha, instance metadata survives stop/start, not delete

Custom instance metadata (startup-script, ssh-keys, user data) lives on the instance object, not the disk. If you delete and recreate the instance reusing the same disk, the metadata is gone. Capture metadata before destructive operations: gcloud compute instances describe nexus-vm --zone us-central1-a --format='value(metadata)'.

1.2 Machine types: families, sizing, and what to pick

Machine types are grouped into families by workload pattern. The family determines the CPU platform, memory ratio, network bandwidth, and pricing curve.

Family	Series	Workload fit	vCPU range	Mem/vCPU (GB)	Notes
General purpose	E2	Cheap, web/dev, low constant load	2-32	0.5-8	Shared-core (e2-micro/small/medium), CPU platform abstracted
General purpose	N2, N2D	Balanced, most production workloads	2-128	0.5-8	N2 = Intel, N2D = AMD EPYC
General purpose	N4	2024 GA, Granite Rapids, Hyperdisk-only	2-80	2-4	Replaces N2 for new builds, but Hyperdisk only
General purpose	C3, C3D	Consistent high throughput	4-176	2-8	Sapphire/Genoa, Titanium NIC, Hyperdisk
General purpose	C4, C4A	2024-25 GA, Emerald Rapids / Axion (Arm)	2-192	2-8	C4A is Google Axion Arm CPU
Compute optimized	C2, C2D, H3	HPC, gaming servers, single-thread heavy	4-360	2-8	H3 is HPC-tuned, no live migration
Memory optimized	M1, M2, M3, X4	SAP HANA, in-memory DBs	40-1920	14-30	X4 = bare metal up to 32 TB RAM
Storage optimized	Z3	Local NVMe-heavy, OLAP, search	88-176	8	Up to 36 TB Local SSD
Accelerator optimized	A2, A3, G2	GPU/ML, video transcoding	12-208	varies	A2=A100, A3=H100/H200, G2=L4

For single-VM operations like nexus-vm, the practical universe is E2, N2/N2D, C3. E2 if you want maximum cost efficiency and your workload is bursty. N2 if you want predictable performance with broad disk type support. C3 if you need consistent high throughput, but be aware C3 forces you onto Hyperdisk Balanced, which has different pricing than PD-Balanced.

📝 Code, list machine types in our zone

gcloud compute machine-types list \
  --filter="zone:us-central1-a AND name~'^(e2|n2)-'" \
  --format="table(name,guestCpus,memoryMb,maximumPersistentDisksSizeGb)" \
  --sort-by="guestCpus,memoryMb"

⚠️ Gotcha, E2 shared-core is fine until it isn't

e2-micro / e2-small / e2-medium share a physical core with other tenants and use a CPU credit bucket. If your workload sustains above the burst baseline (25% for micro, 50% for small, 100% baseline only on medium when credits are full), it gets throttled. For nexus-vm running Python automation that occasionally spikes to do bulk embeddings, an E2 shared-core is the wrong move. Use e2-standard-2 at minimum, or N2 for predictable scheduling.

1.3 Custom machine types and sustained/committed use discounts

N1, N2, N2D, and E2 families allow custom CPU/RAM ratios. You pay vCPU and RAM independently. Useful when your workload wants 4 vCPU and 32 GB, which is twice the memory of n2-standard-4 (16 GB) and half that of n2-highmem-4 (32 GB). The custom path lets you land on exactly the right shape without overprovisioning.

Two automatic discount programs apply with no opt-in needed:

Sustained use discount (SUD), applied automatically each month, lowers the price per vCPU/RAM as the instance runs for more of the month. Up to 30% off list for N1; varies for N2 (different curve, applied as inferred discount). Source: cloud.google.com/compute/docs/sustained-use-discounts.
Committed use discount (CUD), opt-in, you commit 1 or 3 years to a region for a vCPU/RAM amount. 37% off for 1 year, ~55% off for 3 year (resource-based CUDs). Spend-based CUDs are also available for some products. CUDs apply across instances in the region, not tied to a single VM. Source: cloud.google.com/docs/cuds.

💡 Insight, CUDs for a single VM are still worth it

Even for a single nexus-vm, a 1-year resource-based CUD on the exact vCPU/RAM count typically pays back inside ~7 months. The risk is being locked into a region and having to keep paying if you move clouds. Numbers in Section 12 (Cost).

1.4 Disk types: PD, Hyperdisk, Local SSD

Disk	Backing	Max IOPS/disk	Max throughput	Capacity	Use
pd-standard	Spinning HDD	~7,500 r / 15,000 w	~1.2 GB/s	10 GB-64 TB	Cheap archive volumes, batch jobs
pd-balanced	SSD, mid tier	15,000-80,000	240-1,200 MB/s	10 GB-64 TB	Default for most VMs, good cost/perf
pd-ssd	SSD, premium	15,000-100,000	240-1,200 MB/s	10 GB-64 TB	Latency-sensitive DB workloads
pd-extreme	SSD, provisioned IOPS	up to 120,000	2,200 MB/s	500 GB-64 TB	Predictable extreme IOPS, high cost
hyperdisk-balanced	SSD, decoupled IOPS+capacity+throughput	up to 350,000	5,000 MB/s	4 GB-64 TB	Required on N4/C3/C4, future default
hyperdisk-extreme	SSD	up to 500,000	10,000 MB/s	64 GB-64 TB	SAP HANA, high-end DBs
hyperdisk-throughput	HDD-priced, throughput-tuned	low	up to 600 MB/s	2 TB-32 TB	Big sequential reads, log archives
Local SSD	NVMe attached to host	up to 9M (aggregate)	tens of GB/s	375 GB increments	Scratch/cache, ephemeral, lost on stop

🔥 Recency, PD-Standard sunset path

Google has been steering customers off pd-standard. New machine families (N4, C4) do not allow it. For nexus-vm on N2 the option still exists, but pd-balanced is the default for new boot disks and it is rare for the cost difference to justify pd-standard. Source: cloud.google.com/compute/docs/disks.

⚠️ Gotcha, performance scales with disk size

For pd-balanced and pd-ssd, IOPS and throughput scale with capacity. A 100 GB pd-balanced disk caps at ~3,000 IOPS no matter what your VM is. If your DB feels slow, oversize the disk. The size-IOPS curve is documented at cloud.google.com/compute/docs/disks/performance.

⚠️ Gotcha, Local SSD is volatile

Local SSD is physically attached to the host machine. If the VM is stopped, terminated, live-migrated, or the host fails, the data is gone. Use Local SSD only for: ephemeral cache, RAID arrays where the application replicates data elsewhere, and scratch space for batch jobs. Never put a primary database on Local SSD without external replication.

1.5 OS images, Shielded VMs, Confidential VMs

Google maintains a catalog of public images: Debian (default for many tutorials), Ubuntu LTS (12, 14, 16, 18, 20, 22, 24), Rocky Linux 8/9, RHEL 7/8/9, CentOS Stream, SUSE, Windows Server (2016, 2019, 2022, 2025). Each project also gets a private image catalog for custom images you build with Packer or gcloud compute images create. The nexus-vm currently runs Debian (verify with cat /etc/os-release).

📝 Code, list latest Ubuntu 24.04 LTS images

gcloud compute images list \
  --project=ubuntu-os-cloud \
  --filter="family:ubuntu-2404-lts AND status=READY" \
  --sort-by=~creationTimestamp \
  --limit=3

Shielded VM adds three layers: secure boot (UEFI verifies signed firmware), virtual TPM (vTPM for measured boot, attestation), and integrity monitoring (each boot is checksummed and the dashboard shows drift). On by default for newer Google-published images. Costs nothing extra. Source: cloud.google.com/security/shielded-cloud/shielded-vm.

Confidential VM goes further: memory is encrypted in use using AMD SEV (N2D, C2D), AMD SEV-SNP (C3D), Intel TDX (C3), or NVIDIA H100 GPU memory protection (A3). Adds ~5-10% perf overhead on most workloads. Required for processing strongly regulated data on shared infrastructure. Source: cloud.google.com/confidential-computing.

⚠️ Gotcha, custom images and the kernel surprise

If you build a custom image from a Debian VM and apply it to a new VM, you may inherit a kernel pinned to the source machine type. When you create the new VM with a different machine family, the guest tools may fail to detect the new NIC or NVMe driver. The fix is to install google-osconfig-agent, google-cloud-sdk, and the google-compute-engine guest environment package before imaging. Or use gcloud compute images import which automates the conversion.

1.6 Metadata service: 169.254.169.254

Every GCE VM has a magic link-local IP 169.254.169.254 that serves project and instance metadata over HTTP. This is the same pattern as AWS EC2 IMDS but with key differences. Google's metadata service requires the header Metadata-Flavor: Google on every request, which prevents accidental exposure if a web server proxies untrusted user input.

📝 Code, metadata service queries from inside the VM

# Project ID
curl -H "Metadata-Flavor: Google" \
  http://metadata.google.internal/computeMetadata/v1/project/project-id

# Default service account email
curl -H "Metadata-Flavor: Google" \
  http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/email

# Default SA OAuth access token (auto-refreshed by Google)
curl -H "Metadata-Flavor: Google" \
  http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token

# SSH keys configured on the project (NOT used if OS Login is on)
curl -H "Metadata-Flavor: Google" \
  http://metadata.google.internal/computeMetadata/v1/project/attributes/ssh-keys

# Instance ID, zone, machine type
curl -H "Metadata-Flavor: Google" \
  http://metadata.google.internal/computeMetadata/v1/instance/?recursive=true&alt=json | jq

⚠️ Gotcha, SSRF and the metadata service

If your application proxies arbitrary URLs and runs on a GCE VM, you have an SSRF vector to 169.254.169.254. The required Metadata-Flavor: Google header used to be only required for v1 endpoints; it is now mandatory for all responses. But also lock down the egress URL allowlist or block link-local IPs at the application layer.

1.7 Startup scripts and shutdown scripts

Two metadata keys give you boot- and shutdown-time hooks: startup-script (or startup-script-url pointing to GCS) and shutdown-script. Startup scripts run as root every time the VM boots. Shutdown scripts run on a graceful stop with a 90-second timeout, after which the VM is force-stopped. Output goes to the serial console (gcloud compute instances get-serial-port-output nexus-vm) and to journalctl -u google-startup-scripts.service.

📝 Code, set a startup script that installs Ops Agent

gcloud compute instances add-metadata nexus-vm \
  --zone=us-central1-a \
  --metadata=startup-script='#!/bin/bash
set -euo pipefail
if ! systemctl is-active --quiet google-cloud-ops-agent; then
  curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
  bash add-google-cloud-ops-agent-repo.sh --also-install
fi
systemctl enable google-cloud-ops-agent
systemctl start google-cloud-ops-agent'

⚠️ Gotcha, startup scripts run on every boot

Including when you stop and start the VM, including after a live migration in some cases. Make every startup script idempotent. The pattern systemctl is-active --quiet X || install_X is your friend. Do not put one-time bootstrap (project creation, DB init) in a startup script unless you guard it with a sentinel file.

1.8 Live migration, spot, preemptible

By default, Google migrates your VM live to another host for maintenance with no downtime (typically <1 second pause). The default availability policy onHostMaintenance=MIGRATE is correct for production. The other option is TERMINATE, which is used for GPU/TPU machines and Spot VMs.

Spot VMs (the modern name) and Preemptible VMs (the legacy name, capped at 24 hours) are deeply discounted (60-91% off) compute that Google can reclaim with 30 seconds notice. Use for: stateless batch, fault-tolerant queues, CI runners. Do not use for: a single VM hosting your only production stack. Source: cloud.google.com/compute/docs/instances/spot.

🔥 Recency, preemptible VMs are deprecated for new use

The 24-hour-capped legacy preemptible VMs are still functional but Google steers everyone to Spot VMs (no time cap, more flexible reclaim contract). New automation should set --provisioning-model=SPOT and not --preemptible. Source: cloud.google.com/compute/docs/instances/preemptible.

1.9 SSH access methods (HIGH PRIORITY)

This is the section you will reread most. Three independent ways to SSH to a GCE VM, with very different security postures.

1.9.1 Method A, OS Login

Recommended default. SSH keys are tied to your Google identity (the email you log into the Cloud Console with), enforced via the roles/compute.osLogin or roles/compute.osAdminLogin IAM roles. Keys are pushed to the VM by the OS Login agent on boot. Revoking access is instant: remove the IAM binding and the key is removed on next sync.

📝 Code, enable OS Login at the project level

# Project-wide
gcloud compute project-info add-metadata \
  --metadata enable-oslogin=TRUE

# Per-instance override
gcloud compute instances add-metadata nexus-vm \
  --zone=us-central1-a \
  --metadata enable-oslogin=TRUE

# Grant SSH access (regular user)
gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="user:robert.dove@callbrightside.com" \
  --role="roles/compute.osLogin"

# Grant sudo access
gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="user:robert.dove@callbrightside.com" \
  --role="roles/compute.osAdminLogin"

1.9.2 Method B, Metadata SSH keys (legacy)

The traditional path. Each VM (or the project) carries an ssh-keys metadata entry containing public keys with a username:ssh-rsa AAAA... format. The Google Compute Engine guest agent picks these up and writes them into /home/<user>/.ssh/authorized_keys on each boot. This is what ~/.ssh/google_compute_engine is wired up to.

📝 Code, the BSP standard SSH path

# From Robert's local machine
ssh -i ~/.ssh/google_compute_engine dovew@34.55.179.122

# Or via gcloud (auto-handles keys, OS Login if enabled, IAP otherwise)
gcloud compute ssh nexus-vm --zone=us-central1-a

# With command, no shell
gcloud compute ssh nexus-vm --zone=us-central1-a --command="uptime"

⚠️ Gotcha, OS Login and metadata keys conflict

If enable-oslogin=TRUE is set, the metadata ssh-keys entry is ignored. You can be locked out if you flip OS Login on without granting yourself the OS Login IAM role. Always grant the role first, verify SSH works, then enable OS Login.

1.9.3 Method C, IAP TCP forwarding

Identity-Aware Proxy TCP forwarding tunnels SSH (and other TCP) through Google's IAP fabric to a VM that has no public IP. The connection authenticates as your Google identity, and the VM's firewall only needs to allow port 22 from the IAP range 35.235.240.0/20. This is the path to a fully private VM that no one on the internet can reach. Source: cloud.google.com/iap/docs/using-tcp-forwarding.

📝 Code, SSH via IAP (no public IP needed)

# gcloud handles the tunnel automatically
gcloud compute ssh nexus-vm --zone=us-central1-a --tunnel-through-iap

# Required IAM
gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="user:robert.dove@callbrightside.com" \
  --role="roles/iap.tunnelResourceAccessor"

# Required firewall rule
gcloud compute firewall-rules create allow-ssh-from-iap \
  --direction=INGRESS --action=ALLOW \
  --rules=tcp:22 --source-ranges=35.235.240.0/20

💡 Insight, the right answer for nexus-vm

Today, nexus-vm is reachable on a public IP 34.55.179.122 with metadata SSH keys. The bulletproof path is OS Login + IAP TCP forwarding + no public IP. The morpheus.callbrightside.com command center can still serve traffic via a load balancer with the VM as the backend. Section 15 walks through the migration.

1.10 Instance groups, autoscalers, and managed templates

For multi-VM patterns: a managed instance group (MIG) spawns identical VMs from an instance template, can autoscale on CPU, custom metric, or schedule, and integrates with backend services for load balancing. A regional MIG spreads VMs across all zones in a region for HA. We don't use MIGs today on nexus-vm but if BSP grows out of single-VM, the migration path is: snapshot the disk, build an instance template from the snapshot, define a MIG with size 1 first, then scale.

📝 Code, build an instance template from current nexus-vm

# Step 1: snapshot the boot disk
gcloud compute disks snapshot nexus-vm \
  --zone=us-central1-a \
  --snapshot-names=nexus-vm-template-$(date +%Y%m%d)

# Step 2: build a custom image
gcloud compute images create nexus-vm-image-v1 \
  --source-snapshot=nexus-vm-template-$(date +%Y%m%d) \
  --family=nexus-vm

# Step 3: create the template
gcloud compute instance-templates create nexus-vm-tpl-v1 \
  --machine-type=n2-standard-2 \
  --image-family=nexus-vm \
  --image-project=PROJECT_ID \
  --tags=http-server,https-server

1.11 gcloud compute reference (single-VM ops)

Operation	Command
Describe nexus-vm	`gcloud compute instances describe nexus-vm --zone=us-central1-a`
Stop / start	`gcloud compute instances stop nexus-vm --zone=us-central1-a`
Resize machine type	`gcloud compute instances set-machine-type nexus-vm --machine-type=n2-standard-4 --zone=us-central1-a` (VM must be stopped)
Resize boot disk	`gcloud compute disks resize nexus-vm --size=100GB --zone=us-central1-a` then `resize2fs` in the guest
Add a new disk	`gcloud compute disks create data-1 --size=200GB --type=pd-balanced --zone=us-central1-a` then `gcloud compute instances attach-disk nexus-vm --disk=data-1 --zone=us-central1-a`
Snapshot	`gcloud compute disks snapshot nexus-vm --zone=us-central1-a --snapshot-names=nexus-vm-$(date +%Y%m%d)`
Reset (hard)	`gcloud compute instances reset nexus-vm --zone=us-central1-a` (last resort)
Serial console	`gcloud compute instances get-serial-port-output nexus-vm --zone=us-central1-a`
Force-detach static IP	`gcloud compute addresses delete IP_NAME --region=us-central1`

1.12 SVG: nexus-vm topology

Figure 1.1, nexus-vm topology and the boundary between BSP-owned compute (GCE) and external services.

✅ Production checklist, Compute Engine

nexus-vm machine type sized to 95th percentile load + 25% headroom
Boot disk type pd-balanced minimum, sized for required IOPS curve
Daily snapshot schedule on the boot disk (Section 15.7)
Static external IP reserved and named (so it survives stop/start)
OS Login enabled OR ssh-keys metadata audited and pruned monthly
Shielded VM features all on (secure boot, vTPM, integrity monitoring)
Ops Agent installed and reporting (Section 6)
Startup script idempotent and version-controlled
onHostMaintenance=MIGRATE for production VMs
No production workload on Local SSD without external replication
Metadata access restricted, no SSRF surface in user-facing apps

🎓 FOR NEW HIRE, Compute Engine cheat lines

"VM" = "instance" = "GCE VM" all mean the same thing.
gcloud compute instances list shows you everything we run.
To get to nexus-vm: gcloud compute ssh nexus-vm --zone=us-central1-a.
Never run a destructive command (stop, delete, snapshot delete) without typing the VM name explicitly. The console autocompletes, that is dangerous.
Python is the language we automate GCP from. The library is google-cloud-compute. Install with pip install google-cloud-compute and read the SDK docs.

🔒 2. IAM, Service Accounts, Audit HIGH PRIORITY

IAM (Identity and Access Management) is how Google decides whether an identity (a user, a service account, or a Google group) is allowed to perform an action on a resource. Mastering IAM is the difference between secure, predictable infrastructure and a 3 a.m. incident root cause that reads "the default service account had Owner."

2.1 The hierarchy: organization, folder, project, resource

Resources sit in a four-level hierarchy. IAM bindings attached at any level inherit downward.

Organization, the root, tied to a Cloud Identity or Workspace domain (callbrightside.com).
Folder, optional grouping (dev, prod, sandbox, by team).
Project, the billing and quota boundary, where most resources actually live.
Resource, e.g. a GCS bucket, a GCE instance, a Cloud SQL DB, sometimes accepts its own bindings.

A binding is a 3-tuple (member, role, condition?). Members are the identity, roles are bundles of permissions, conditions are optional CEL expressions that gate the binding by request attributes. Source: cloud.google.com/iam/docs/overview.

Figure 2.1, IAM hierarchy. Bindings on the Org cascade to Folders, then Projects, then Resources.

⚠️ Gotcha, inheritance is additive only

IAM bindings add permissions as you go down the tree, never subtract. If you grant Owner at the Org level, you cannot revoke it at the project level. The only way to remove an inherited permission is to remove the higher binding or use a Deny policy (Org Policy + IAM Deny, see 2.10).

2.2 Service accounts, the identity for automation

A service account (SA) is a Google identity owned by a project, used by software (not humans) to authenticate. Email format: NAME@PROJECT_ID.iam.gserviceaccount.com. Every project gets several service accounts created automatically.

Service account	Email pattern	Purpose
Compute Engine default SA	`PROJECT_NUMBER-compute@developer.gserviceaccount.com`	Identity assumed by GCE VMs unless overridden
App Engine default SA	`PROJECT_ID@appspot.gserviceaccount.com`	App Engine and Cloud Functions Gen 1 default
Google APIs SA	`PROJECT_NUMBER@cloudservices.gserviceaccount.com`	Used by GCP services to act on your behalf, e.g. Deployment Manager
Cloud Build SA	`PROJECT_NUMBER@cloudbuild.gserviceaccount.com`	Cloud Build runs builds as this identity
Cloud Run SA	Compute default unless overridden	Service identity for Cloud Run revisions
Pub/Sub SA	`service-PROJECT_NUMBER@gcp-sa-pubsub.iam.gserviceaccount.com`	Pub/Sub uses this for push delivery

🔥 Recency, default SA permissions tightened

Before May 2024, the Compute Engine default SA was granted Editor (roles/editor) on the project at creation. As of 2024 organizations have an org policy iam.automaticIamGrantsForDefaultServiceAccounts that defaults to disabled, meaning new projects no longer auto-grant Editor. Verify with gcloud projects get-iam-policy PROJECT_ID. Source: cloud.google.com/iam/docs/service-account-overview.

2.3 Service account keys, the danger zone

You can mint a downloadable JSON key for a service account. This key is a long-lived bearer credential. If it leaks, the holder authenticates as the SA from anywhere on the internet until you rotate the key. Rules of the road:

Avoid SA keys for any workload that can use Application Default Credentials, Workload Identity Federation, or attached SA on a GCE VM/Cloud Run service.
If you must mint one, set an expiry, store it in Secret Manager, and rotate every 90 days.
Org policy iam.disableServiceAccountKeyCreation blocks key creation entirely.
Org policy iam.disableServiceAccountKeyUpload blocks BYO key uploads.

📝 Code, list and rotate SA keys

# List keys for an SA
gcloud iam service-accounts keys list \
  --iam-account=nexus-runner@PROJECT_ID.iam.gserviceaccount.com

# Create a new key (90-day expiry)
gcloud iam service-accounts keys create new-key.json \
  --iam-account=nexus-runner@PROJECT_ID.iam.gserviceaccount.com

# Disable an old key (preferred over delete during rotation)
gcloud iam service-accounts keys disable KEY_ID \
  --iam-account=nexus-runner@PROJECT_ID.iam.gserviceaccount.com

# Delete after the rollout is verified
gcloud iam service-accounts keys delete KEY_ID \
  --iam-account=nexus-runner@PROJECT_ID.iam.gserviceaccount.com

2.4 Workload Identity Federation (the right answer)

WIF lets a workload outside GCP (GitHub Actions, AWS, Okta, anything that issues an OIDC or SAML token) impersonate a Google service account without ever touching a downloaded key. You configure a workload identity pool, define provider trust (issuer URL, audience, attribute mapping), and grant the external identity roles/iam.workloadIdentityUser on the target SA.

📝 Code, GitHub Actions to GCP without a key

# 1. Create the workload identity pool
gcloud iam workload-identity-pools create gh-pool \
  --location=global \
  --display-name="GitHub Actions"

# 2. Add the GitHub OIDC provider
gcloud iam workload-identity-pools providers create-oidc gh-provider \
  --workload-identity-pool=gh-pool --location=global \
  --issuer-uri="https://token.actions.githubusercontent.com" \
  --attribute-mapping="google.subject=assertion.sub,attribute.repository=assertion.repository" \
  --attribute-condition="assertion.repository=='callbrightside/nexus'"

# 3. Grant the SA's WorkloadIdentityUser role to the GitHub repo subject
gcloud iam service-accounts add-iam-policy-binding \
  ci-deployer@PROJECT_ID.iam.gserviceaccount.com \
  --role=roles/iam.workloadIdentityUser \
  --member="principalSet://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/gh-pool/attribute.repository/callbrightside/nexus"

2.5 Role taxonomy: basic, predefined, custom

Tier	Examples	Use
Basic (legacy)	roles/owner, roles/editor, roles/viewer	Avoid in production. Far too broad.
Predefined	roles/compute.instanceAdmin, roles/storage.objectViewer, roles/secretmanager.secretAccessor	Standard answer. Use these.
Custom	You define a list of permissions	For least-privilege when no predefined role fits

📝 Code, build a custom role for the nexus runner

# nexus-runner.yaml
title: "Nexus Runner"
description: "Read GCS, write logs, no admin"
stage: GA
includedPermissions:
  - storage.objects.get
  - storage.objects.list
  - logging.logEntries.create
  - secretmanager.versions.access

gcloud iam roles create nexusRunner \
  --project=PROJECT_ID \
  --file=nexus-runner.yaml

2.6 Permissions you actually need on nexus-vm

Action	Required permission(s)	Predefined role
SSH to nexus-vm via gcloud	compute.instances.get + iap.tunnelInstances.accessViaIAP	roles/compute.osLogin + roles/iap.tunnelResourceAccessor
Stop/start nexus-vm	compute.instances.stop, compute.instances.start	roles/compute.instanceAdmin.v1
Read a GCS bucket	storage.objects.get, storage.objects.list	roles/storage.objectViewer
Read a Secret Manager value	secretmanager.versions.access	roles/secretmanager.secretAccessor
Write log entries	logging.logEntries.create	roles/logging.logWriter
Write metrics	monitoring.timeSeries.create	roles/monitoring.metricWriter
Snapshot a disk	compute.disks.createSnapshot	roles/compute.storageAdmin

2.7 Conditional bindings (CEL)

You can attach a Common Expression Language condition to any binding. The binding only fires when the request matches.

📝 Code, restrict GCS read access to a specific bucket and time window

gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="user:auditor@callbrightside.com" \
  --role="roles/storage.objectViewer" \
  --condition='expression=resource.name.startsWith("projects/_/buckets/bsp-audit") && request.time < timestamp("2026-12-31T23:59:59Z"),title=audit-window-2026'

2.8 Domain-wide delegation

For Workspace customers, a service account can be granted the right to impersonate any user in the domain for specific OAuth scopes. Used by automation that needs to send mail as a user, read calendars across the org, etc. Configured in admin.google.com under Security → API Controls → Domain-wide Delegation. The SA's "client ID" (numeric, not the email) is whitelisted with a list of scopes. Powerful, dangerous, audit it.

2.9 Audit logs taxonomy

Cloud Audit Logs come in four streams:

Admin Activity, always on, free, 400-day retention. Captures any API call that modifies config or metadata. Source of truth for "who turned off the firewall."
Data Access, off by default for most services (BigQuery is the exception), can be expensive, captures read/write of data. Enable selectively per service.
System Event, generated by Google, captures auto-actions like live migrations.
Policy Denied, captures denied requests so you can debug missing IAM.

📝 Code, query audit logs for IAM changes

gcloud logging read \
  'logName=~"cloudaudit.googleapis.com%2Factivity" AND protoPayload.serviceName="iam.googleapis.com"' \
  --limit=20 --format="table(timestamp,protoPayload.authenticationInfo.principalEmail,protoPayload.methodName,protoPayload.resourceName)"

2.10 IAM Deny policies

2023+ feature. A Deny policy is the only way to prevent a permission regardless of inherited Allow bindings. Attaches at Org, Folder, or Project level. Useful for guardrails like "no one, not even Org Admins, can disable Audit Logging."

📝 Code, deny a specific permission for everyone except a break-glass group

# deny-policy.json
{
  "displayName": "deny-disable-audit-logging",
  "rules": [{
    "deniedPrincipals": ["principalSet://goog/public:all"],
    "exceptionPrincipals": ["principalSet://goog/group/break-glass@callbrightside.com"],
    "deniedPermissions": ["logging.googleapis.com/sinks.delete"]
  }]
}

gcloud iam policies create deny-disable-audit \
  --kind=denypolicies \
  --policy-file=deny-policy.json \
  --attachment-point=cloudresourcemanager.googleapis.com/organizations/ORG_ID

2.11 Troubleshooter, Policy Analyzer, Policy Simulator

IAM Troubleshooter, "why can user X not do action Y on resource Z" wizard. Console → IAM → Troubleshoot.
Policy Analyzer, run queries like "list all bindings that grant any permission on resource Z." Useful for compliance audits.
Policy Simulator, simulate a policy change before applying it. Tells you which historical requests would have been allowed/denied differently.

✅ Production checklist, IAM

No human user has roles/owner or roles/editor on the production project
Default Compute SA does not have Editor (verify org policy)
SA keys forbidden by org policy or rotated <90 days
External CI uses Workload Identity Federation, not downloaded keys
All bindings reviewed quarterly with Policy Analyzer
Data Access logs on for high-value services (Secret Manager, GCS audit bucket)
Deny policy guarding Audit Logs and IAM-modifying permissions
Custom roles preferred over basic roles for least privilege

🎓 FOR NEW HIRE, IAM mental model

Every API call to GCP gets stamped "who is asking, what are they trying to do, on what resource." IAM is the lookup table that returns yes or no. The fastest debugging path when you get a 403: copy the exact permission string from the error, search the role catalog (cloud.google.com/iam/docs/understanding-roles), and grant the smallest predefined role that contains it. Never grant Owner to fix something. Once you do, you cannot tell what permission was actually missing.

🌐 3. Networking, VPC, Load Balancing HIGH PRIORITY

Every byte that reaches nexus-vm traverses a chain of network primitives. Understanding the chain is what turns "the site is down" from a 30-minute incident into a 2-minute fix.

3.1 VPC architecture

A VPC (Virtual Private Cloud) is a global, software-defined network. Unlike AWS, where each VPC is regional, Google's VPC spans every region. A subnet, however, is regional. So a single VPC named default typically has one auto-mode subnet per region, each with a non-overlapping CIDR.

VPCs come in two modes:

Auto mode, Google manages a /20 subnet per region in the 10.128.0.0/9 range. Convenient for sandboxes.
Custom mode, you create subnets explicitly with chosen CIDRs. Required for any production workload.

📝 Code, create a custom-mode VPC and a subnet

gcloud compute networks create bsp-prod-vpc --subnet-mode=custom

gcloud compute networks subnets create nexus-subnet \
  --network=bsp-prod-vpc \
  --region=us-central1 \
  --range=10.10.0.0/24 \
  --enable-private-ip-google-access \
  --enable-flow-logs

3.2 Subnets, secondary ranges, alias IPs

A subnet has a primary IPv4 range used for VM NICs, plus optional secondary ranges used for GKE pod and service IPs (alias IP), or for assigning a /28 to a Cloud SQL Private Service Connection. Source: cloud.google.com/vpc/docs/subnets.

3.3 Firewall rules

VPC firewalls are stateful. A connection initiated from inside the VPC is allowed back in, you do not need a separate egress allow rule. Rules are scoped to a network, evaluated in priority order (lowest number wins, ties decide by allow-deny ordering, deny-wins).

Field	Meaning
Direction	INGRESS or EGRESS
Action	ALLOW or DENY
Priority	0-65535, lower = higher priority. Default 1000.
Source ranges	List of CIDRs (ingress only)
Source tags / SAs	Restrict to VMs with a network tag or running as an SA
Target tags / SAs	Apply only to VMs with a tag or SA
Protocols/ports	e.g. `tcp:80,443`, `udp:53`, `all`

📝 Code, the BSP standard nexus-vm firewall posture

# SSH only from IAP
gcloud compute firewall-rules create allow-ssh-iap \
  --network=bsp-prod-vpc --direction=INGRESS --action=ALLOW \
  --rules=tcp:22 --source-ranges=35.235.240.0/20 \
  --target-tags=ssh-iap

# HTTPS from Cloudflare only
gcloud compute firewall-rules create allow-https-cf \
  --network=bsp-prod-vpc --direction=INGRESS --action=ALLOW \
  --rules=tcp:443 --source-ranges=173.245.48.0/20,103.21.244.0/22,103.22.200.0/22,103.31.4.0/22,141.101.64.0/18,108.162.192.0/18,190.93.240.0/20,188.114.96.0/20,197.234.240.0/22,198.41.128.0/17,162.158.0.0/15,104.16.0.0/13,104.24.0.0/14,172.64.0.0/13,131.0.72.0/22 \
  --target-tags=web

# Default deny all other ingress (priority 65534, hardened)
gcloud compute firewall-rules create deny-all-ingress \
  --network=bsp-prod-vpc --direction=INGRESS --action=DENY \
  --rules=all --source-ranges=0.0.0.0/0 --priority=65534

⚠️ Gotcha, default network is too permissive

Auto-created default network ships with default-allow-ssh, default-allow-rdp, default-allow-icmp, and default-allow-internal rules wide open to 0.0.0.0/0. Production should use a custom VPC with no default rules. If you must keep default, delete default-allow-ssh and default-allow-rdp immediately.

⚠️ Gotcha, network tags vs target service accounts

Network tags are unauthenticated metadata, anyone with compute.instances.setTags can add the web tag to any VM and inherit its firewall rules. Target service accounts require iam.serviceAccounts.actAs and are the secure default for production. Migrate from tags to SAs in the firewall rules.

3.4 Cloud NAT, Private Google Access, Private Service Connect

Cloud NAT, gives outbound internet to VMs without external IPs. Regional, scaled by NAT IPs you provision (start with auto, watch for port exhaustion at >~64,000 outbound conns per IP).
Private Google Access, lets a VM with no external IP reach googleapis.com endpoints (storage, logging, monitoring, secrets) over Google's backbone. Enable at the subnet level. Free.
Private Service Connect (PSC), exposes a managed service (Cloud SQL, third-party SaaS) at a private IP inside your VPC. Hides the public endpoint entirely.

3.5 Cloud Load Balancing variants

Type	Layer	Scope	Use
Global external Application LB	L7 HTTPS	Global anycast IP	Public web apps, multi-region failover
Regional external Application LB	L7 HTTPS	Regional	Single-region web with custom auth
Internal Application LB	L7 HTTPS	Regional, VPC-internal	Microservices inside the VPC
Global external Network LB	L4 TCP/UDP/SSL	Global anycast IP	Non-HTTP global, e.g. game servers
Regional external Network LB	L4 TCP/UDP	Regional	Pass-through, preserves source IP
Internal Network LB	L4 TCP/UDP	Regional, VPC-internal	Internal pass-through

3.6 Cloud Armor

WAF for the global Application Load Balancer. Features: pre-configured OWASP rules, rate-limiting (per-IP per-minute thresholds), bot management, geo-based allow/deny (GeoIP), Adaptive Protection (ML-based DDoS), reCAPTCHA Enterprise integration. Enabled per backend service. The Cloudflare in front of nexus-vm handles much of this today, but if we move to a GCP load balancer, Cloud Armor takes over.

3.7 Cloud CDN

Edge caching tied to the global Application LB. Set --enable-cdn on a backend service. Cache keys default to host + path, customizable. Negative caching for 404s. Cache invalidation via gcloud compute url-maps invalidate-cdn-cache. Today Cloudflare is our CDN, Cloud CDN is the migration target if we leave Cloudflare.

3.8 Cloud DNS

Authoritative managed DNS. Two zone types: public (resolvable on the internet) and private (resolvable only inside designated VPCs). DNSSEC available, DNS forwarding for hybrid (on-prem to GCP). Today callbrightside.com DNS is on Cloudflare; Cloud DNS is the option if we centralize on Google.

3.9 VPC peering, Shared VPC, VPC-SC

VPC Peering, point-to-point, non-transitive connection between two VPCs. CIDRs must not overlap. No bandwidth limit.
Shared VPC, one host project owns the network, multiple service projects attach VMs. Centralizes networking ops.
VPC Service Controls (VPC-SC), defines a security perimeter around services like GCS, BigQuery, Secret Manager. Even if an SA key leaks, it cannot exfiltrate data outside the perimeter.

3.10 Interconnect, VPN, Network Connectivity Center

Hybrid cloud options:

Cloud VPN, IPsec tunnels, classic and HA flavors, ~3 Gbps per tunnel.
Dedicated Interconnect, 10 Gbps or 100 Gbps physical circuit, requires a colocation provider.
Partner Interconnect, 50 Mbps to 50 Gbps via a service provider.
Cross-Cloud Interconnect, dedicated link to AWS, Azure, OCI, Alibaba.
Network Connectivity Center (NCC), hub-and-spoke management for complex multi-VPC and hybrid topologies.

3.11 IAP for HTTPS

Beyond TCP forwarding (Section 1.9), IAP can sit in front of an HTTPS Load Balancer to add Google identity authentication on top of any backend (GCE, GKE, Cloud Run, App Engine). Set --enable-iap on the backend, the LB returns a Google sign-in flow before the request reaches the backend. The signed JWT is forwarded as X-Goog-IAP-JWT-Assertion for the backend to verify.

3.12 Static and ephemeral IPs, IP forwarding

Ephemeral external IP, attached at instance create time, lost on instance delete (or stop, if not promoted).
Static external IP, regional resource you reserve and attach. Survives instance lifecycle. Verify nexus-vm uses one.
IP forwarding, allow a VM to act as a NAT/router by setting canIpForward=true. Required for software defined gateways.

📝 Code, verify nexus-vm has a static IP

gcloud compute addresses list --filter="address=34.55.179.122"
# If empty, the IP is ephemeral. Promote it:
gcloud compute addresses create nexus-vm-static \
  --addresses=34.55.179.122 \
  --region=us-central1

3.13 Network telemetry: VPC Flow Logs, Mirror, Intelligence Center

VPC Flow Logs, sampled connection records exported to Cloud Logging. Enable per-subnet. Cost scales with sample rate.
Packet Mirror, full pcap of selected traffic for forensics or IDS appliances.
Network Intelligence Center, dashboards for topology, connectivity tests, performance, firewall insights, and routes.

Figure 3.1, the path a request takes from a browser to the Python framework on nexus-vm.

✅ Production checklist, Networking

Custom-mode VPC, no default network in production
Default-deny ingress firewall at priority 65534
Service-account-based firewall targets (not network tags) for production
Cloud NAT for outbound from any VM without an external IP
Private Google Access on every subnet hosting workload VMs
VPC Flow Logs on with reasonable sampling (5-10%)
Static external IP reserved for any user-facing endpoint
Cloud Armor (or Cloudflare equivalent) WAF in front of all public endpoints
DNS records have TTL <=300s for any IP that might change
Connectivity Tests run after every firewall change

🎓 FOR NEW HIRE, Networking field guide

"VPC" = software-defined network. "Subnet" = the per-region range of IPs. "Firewall rule" = which ports/sources can reach which VMs. "Load balancer" = the front door if we have more than one VM. Use gcloud compute networks list, gcloud compute networks subnets list, gcloud compute firewall-rules list to see the current state. When something is unreachable, the order of debugging is: (1) DNS, (2) Cloudflare, (3) firewall rule, (4) the VM's own iptables, (5) the application listening on the port. The Network Intelligence Center connectivity test does the first three for you.

💾 4. Storage, GCS, Cloud SQL HIGH PRIORITY

Storage on GCP comes in three flavors that matter for nexus-vm: object storage (GCS), block storage (PD/Hyperdisk attached to the VM), and managed databases (Cloud SQL). This section covers all three plus Filestore for shared files.

4.1 GCS storage classes

Class	Min duration	Storage $/GB-mo	Retrieval $/GB	Use
Standard	None	~$0.020	$0	Hot, frequent access
Nearline	30 days	~$0.010	$0.01	Monthly backups
Coldline	90 days	~$0.004	$0.02	Quarterly backups
Archive	365 days	~$0.0012	$0.05	Compliance, long-term

Min duration means you pay as if the object lived that long even if you delete earlier. All classes have the same single-digit-millisecond first-byte latency, the difference is purely cost-vs-retention. Multi-regional and dual-regional buckets cost slightly more for higher availability. Source: cloud.google.com/storage/pricing.

4.2 Lifecycle, versioning, retention, bucket lock

Lifecycle rules transition or delete objects based on age, version count, or class.

📝 Code, lifecycle for nexus-vm snapshots in GCS

# lifecycle.json
{
  "lifecycle": {
    "rule": [
      {"action": {"type": "SetStorageClass", "storageClass": "NEARLINE"},
       "condition": {"age": 30}},
      {"action": {"type": "SetStorageClass", "storageClass": "COLDLINE"},
       "condition": {"age": 90}},
      {"action": {"type": "Delete"},
       "condition": {"age": 365}}
    ]
  }
}

gcloud storage buckets update gs://bsp-backups \
  --lifecycle-file=lifecycle.json

Object versioning keeps prior versions when you overwrite or delete. Retention policy enforces a minimum age before deletion. Bucket Lock makes a retention policy permanent (cannot be removed, only extended). Use Bucket Lock for compliance buckets.

4.3 IAM vs ACLs on buckets

Two access control systems coexist:

Uniform bucket-level access (UBLA), recommended, IAM only.
Fine-grained, legacy, IAM + per-object ACLs. Hard to audit.

Set UBLA at bucket creation: gcloud storage buckets create gs://bsp-backups --uniform-bucket-level-access. The Object ACL system is essentially deprecated for new buckets. Source: cloud.google.com/storage/docs/uniform-bucket-level-access.

4.4 Signed URLs, signed policies, CORS

A signed URL grants time-bounded access to a single object using a service account's private key. CORS lets browser-based JS upload/download from a bucket.

📝 Code, signed URL in Python

from google.cloud import storage
from datetime import timedelta

client = storage.Client()
bucket = client.bucket("bsp-uploads")
blob = bucket.blob("reports/q2-2026.pdf")

url = blob.generate_signed_url(
    version="v4",
    expiration=timedelta(minutes=15),
    method="GET",
)
print(url)

⚠️ Gotcha, signed URLs require a key or signBlob permission

To sign with a service account on a GCE VM, the VM's SA needs iam.serviceAccounts.signBlob on itself. Otherwise you get an opaque error about no private key. Grant roles/iam.serviceAccountTokenCreator on the SA to the SA itself.

4.5 Cloud SQL configurations

Engine	Versions	Notes
MySQL	5.7, 8.0, 8.4	5.7 EOL approaching
PostgreSQL	11, 12, 13, 14, 15, 16, 17	14, 15, 16 supported. 11 EOL.
SQL Server	2017, 2019, 2022 (Std, Enterprise, Web, Express)	License included or BYOL

Tiers: shared-core (db-f1-micro, db-g1-small, retired in 2024+ for some versions), custom (1-96 vCPU, 0.9-624 GB RAM), high-memory presets. Storage: SSD or HDD, autogrow available.

4.6 HA, replicas, PITR, backups, maintenance

High Availability (HA), regional configuration with synchronous standby in a different zone. Failover RTO ~60 seconds, RPO 0. ~2x cost.
Read replicas, async, in same or different region. For read scaling and DR. Can be promoted to standalone.
Point-in-time recovery (PITR), requires write-ahead logs enabled, can restore to any second within the retention window (default 7 days, up to 35).
Automated backups, daily, configurable window. Multi-regional location.
Maintenance window, weekly, minor version updates. Set to off-hours on Saturday morning.

4.7 Cloud SQL Auth Proxy and IAM auth

The Auth Proxy is a small binary (Go) that establishes a TLS-encrypted tunnel from your application to Cloud SQL using your Google credentials, no password needed for the client side and no firewall rules to manage. IAM authentication lets a Google identity log into Postgres/MySQL with a short-lived token.

📝 Code, run the Cloud SQL Auth Proxy on nexus-vm

# Download (Linux amd64)
curl -o cloud-sql-proxy \
  https://storage.googleapis.com/cloud-sql-connectors/cloud-sql-proxy/v2.13.0/cloud-sql-proxy.linux.amd64
chmod +x cloud-sql-proxy

# Connect to a Postgres instance via Unix socket
./cloud-sql-proxy PROJECT_ID:us-central1:bsp-pg --unix-socket=/cloudsql

# In your app, connect to host=/cloudsql/PROJECT_ID:us-central1:bsp-pg

4.8 PD vs Filestore vs GCS, when to use which

Need	Pick	Why
Boot disk for nexus-vm	pd-balanced	Default, sized to needed IOPS
Database files	pd-ssd or hyperdisk-balanced	Predictable latency
Shared files across multiple VMs	Filestore (NFS)	POSIX semantics
Daily backups, snapshots, large objects	GCS	Cheap, durable, lifecycle
Logs, ML training data, public assets	GCS	Throughput scales horizontally
Scratch / cache	Local SSD	Lowest latency, ephemeral

4.9 Filestore tiers

Tier	Min capacity	Throughput	Use
Basic HDD	1 TB	100 MB/s/TB	Sequential, low cost
Basic SSD	2.5 TB	1.2 GB/s	Mixed workloads
Zonal (Enterprise)	1 TB	scales linearly	SLA-backed, single zone
Regional	1 TB	scales linearly	HA across zones
Enterprise (legacy)	1 TB	scales linearly	Replaced by Regional

4.10 Backup and DR strategies for nexus-vm

Daily PD snapshots on the boot disk (Section 15.7).
Weekly export of critical data to GCS (regional bucket with versioning).
Monthly transition to Nearline/Coldline via lifecycle.
Quarterly DR drill: rebuild a VM from the latest snapshot in a different zone.

Figure 4.1, storage tiers plotted by cost vs first-byte latency. Durability is 11 nines for all GCS classes and PD-replicated tiers.

✅ Production checklist, Storage

UBLA on every bucket, no fine-grained ACLs in production
Versioning + lifecycle on backup buckets
Bucket Lock on compliance buckets
Daily snapshot policy on the nexus-vm boot disk
Weekly export of /opt/nexus to GCS regional bucket
Cloud SQL HA enabled in production, automated backups, PITR retention >=14 days
Cloud SQL Auth Proxy used in place of public-IP + password auth
Signed URLs default expiry <=15 minutes for sensitive content
CORS rules limited to known origins, not *
Quarterly DR drill from snapshot to a fresh VM

🎓 FOR NEW HIRE, Storage in one paragraph

GCS holds anything that is not actively being read by a database (assets, backups, logs, ML data). Disks live attached to a VM and hold the OS and database. Cloud SQL is a managed MySQL/Postgres/SQL Server. Pick GCS by default, pick a Disk only when you need POSIX file semantics on a single VM, pick Cloud SQL when you need ACID transactions and don't want to operate Postgres yourself. The Python SDK for GCS is google-cloud-storage, install with pip install google-cloud-storage and the docs are at cloud.google.com/python/docs/reference/storage.

🔒 5. Secret Manager, Cloud KMS MEDIUM PRIORITY

Secret Manager holds the application secrets that nexus-vm needs (Anthropic API key, Cloudflare API token, BRICKS_WP_APP_PASSWORD, Vapi API key). Cloud KMS holds the encryption keys that protect everything else. Different tools, different jobs, often confused.

5.1 Secret Manager: model and versioning

A secret is a named container. Each secret has multiple versions (1, 2, 3...), only one of which is the latest at any time. Versions are immutable. To rotate a secret, add a new version, point your app at latest or pin to a specific version. Each access is auditable.

📝 Code, the standard Secret Manager workflow

# Create a secret
gcloud secrets create BRICKS_WP_APP_PASSWORD \
  --replication-policy=automatic

# Add a version (rotation)
echo -n "new_app_password_value" | \
  gcloud secrets versions add BRICKS_WP_APP_PASSWORD --data-file=-

# Read the latest version (from inside nexus-vm)
gcloud secrets versions access latest --secret=BRICKS_WP_APP_PASSWORD

# Disable an old version (preferred during rollout)
gcloud secrets versions disable 3 --secret=BRICKS_WP_APP_PASSWORD

# Destroy after the rollout is verified (irreversible)
gcloud secrets versions destroy 3 --secret=BRICKS_WP_APP_PASSWORD

5.2 Replication and CMEK

Automatic, Google replicates the secret across multiple regions in your jurisdiction. Default. Highest availability.
User-managed, you pick the regions, useful for data residency.
CMEK, encrypt the secret at rest with your own KMS key, required for some compliance regimes.

5.3 Rotation and notifications

Secret Manager has built-in rotation scheduling. You set a --next-rotation-time and --rotation-period, and Secret Manager publishes a Pub/Sub message at the scheduled time. Your rotation handler (Cloud Function, Cloud Run job, etc.) creates a new version. Source: cloud.google.com/secret-manager/docs/rotation-recommendations.

5.4 Cloud KMS: model

Hierarchy: key ring → key → key version. Key rings are regional (or global, or multi-regional). A key has a purpose (symmetric encryption, asymmetric signing, asymmetric decryption, MAC), an algorithm, and a protection level (software or HSM). Each version is the actual cryptographic material. KMS never returns the raw key, you call encrypt, decrypt, sign, or verify.

5.5 CMEK on Compute, GCS, Cloud SQL

Customer-Managed Encryption Keys override the default Google-managed encryption with a key in your KMS. Apply at resource create time:

📝 Code, create a GCS bucket with CMEK

gcloud storage buckets create gs://bsp-cmek-test \
  --default-encryption-key=projects/PROJECT_ID/locations/us-central1/keyRings/bsp-prod/cryptoKeys/gcs-cmek-1

# Grant GCS service agent permission to use the key
gcloud kms keys add-iam-policy-binding gcs-cmek-1 \
  --keyring=bsp-prod --location=us-central1 \
  --member=serviceAccount:service-PROJECT_NUMBER@gs-project-accounts.iam.gserviceaccount.com \
  --role=roles/cloudkms.cryptoKeyEncrypterDecrypter

5.6 CSEK (legacy) and external HSM

Customer-Supplied Encryption Keys let you provide raw bytes per request. Mostly deprecated in favor of CMEK. External HSM via Cloud HSM (managed) or Cloud External Key Manager (your HSM at a partner like Equinix). Skip unless mandated by compliance.

5.7 Secret Manager into nexus-vm

📝 Code, fetch a secret from Python on nexus-vm

from google.cloud import secretmanager

def get_secret(name: str) -> str:
    client = secretmanager.SecretManagerServiceClient()
    project = "bsp-prod"
    path = f"projects/{project}/secrets/{name}/versions/latest"
    response = client.access_secret_version(request={"name": path})
    return response.payload.data.decode("utf-8")

bricks_pwd = get_secret("BRICKS_WP_APP_PASSWORD")
cf_token = get_secret("CLOUDFLARE_API_TOKEN")

⚠️ Gotcha, never log secret values

Set up a logging filter that drops any field containing BRICKS_WP_APP_PASSWORD, CLOUDFLARE_API_TOKEN, ANTHROPIC_API_KEY. The fastest way to leak a secret is to print(get_secret(...)) during a debug session and forget. Secret Manager itself only audits "this secret was accessed", which still leaves you to track down where it went.

✅ Production checklist, Secrets and KMS

Every API key, password, and credential lives in Secret Manager
Rotation period set on every secret (90 days max for API keys)
nexus-vm SA has roles/secretmanager.secretAccessor on each secret it needs, scoped not project-wide
CMEK on the production GCS backup bucket and Cloud SQL instance
Key rings regional (us-central1) matching the workload
Audit logs (Data Access) on for Secret Manager and KMS
Pub/Sub rotation handler tested at least once per quarter

🎓 FOR NEW HIRE, Secret Manager rule

If a value would let someone act as you, it goes in Secret Manager. Period. No .env files committed to git, no values pasted in Slack, no config.py with constants. The nexus-vm service account gets read access at runtime and Secret Manager logs every access for the audit trail.

📊 6. Observability, Logging, Monitoring HIGH PRIORITY

If nexus-vm goes sideways, observability is how you find out before Robert does. Cloud Logging, Cloud Monitoring, Cloud Trace, Profiler, and Error Reporting are five products under the umbrella name "Cloud Operations Suite" (formerly Stackdriver).

6.1 Cloud Logging architecture

Every log entry is a structured JSON document with a timestamp, a log name, a severity, a resource label set, and a payload. Entries flow into buckets (storage with retention), filtered by sinks (which entries go to which bucket or destination). The _Default bucket holds 30 days, the _Required bucket holds Admin Activity audit logs for 400 days, both free.

Concept	What it is
Log entry	One row, structured fields, payload
Log name	Logical stream, e.g. `cloudaudit.googleapis.com%2Factivity`
Bucket	Storage location with retention policy
Sink	Filter expression + destination (bucket, BigQuery, GCS, Pub/Sub)
View	Restricts who sees which entries inside a bucket
Scope	Cross-project log scope for a single Logs Explorer

6.2 Logs Explorer query language

📝 Code, common Logs Explorer queries

# All warnings/errors from nexus-vm in the last hour
resource.type="gce_instance"
resource.labels.instance_id="123456789"
severity>=WARNING
timestamp>="2026-04-28T18:00:00Z"

# IAM policy changes, last 7 days
logName=~"cloudaudit.googleapis.com%2Factivity"
protoPayload.serviceName="iam.googleapis.com"
protoPayload.methodName=~"SetIamPolicy"

# Failed SSH attempts
resource.type="gce_instance"
jsonPayload.message=~"Failed password for"

6.3 Retention, log-based metrics, alerts

Log-based metrics convert a log filter into a counter or distribution. Useful for "alert me when error rate > 1/min."

📝 Code, create a log-based metric for nexus-vm 5xx

gcloud logging metrics create nexus_5xx \
  --description="5xx responses on nexus-vm" \
  --log-filter='resource.type="gce_instance" AND jsonPayload.status>=500'

6.4 Cloud Monitoring

Workspaces, dashboards, MQL (Monitoring Query Language) for advanced queries, alerts with conditions and notification channels (email, Slack, PagerDuty, webhook). Default integrations exist for every GCE metric (CPU, disk, network, instance/up).

6.5 Uptime checks and SLOs

Uptime checks ping a public URL from 6 global locations every minute. Failures trigger alerts. Configure in Monitoring → Uptime checks. Add an HTTPS check for https://morpheus.callbrightside.com.

📝 Code, define an uptime check

gcloud monitoring uptime create morpheus-https \
  --resource-type=uptime-url \
  --resource-labels="host=morpheus.callbrightside.com,project_id=PROJECT_ID" \
  --request-method=GET --path=/ --port=443 \
  --period=60s --timeout=10s

6.6 Cloud Trace, Profiler, Error Reporting

Cloud Trace, distributed tracing, OpenTelemetry-compatible. Auto-trace HTTP via OT instrumentation.
Cloud Profiler, sample-based CPU and heap profiler with low overhead. Add the agent to a long-running process and view flamegraphs in the console.
Error Reporting, groups identical errors across services, dedupes, alerts when a new fingerprint appears.

6.7 Ops Agent vs deprecated agents

The Ops Agent (single binary, Linux + Windows) replaces the legacy Stackdriver Logging Agent and Monitoring Agent. Configures via /etc/google-cloud-ops-agent/config.yaml.

📝 Code, ship Python framework logs from /opt/nexus to Cloud Logging

# /etc/google-cloud-ops-agent/config.yaml
logging:
  receivers:
    nexus_app:
      type: files
      include_paths:
        - /opt/nexus/logs/*.log
        - /opt/nexus/nexus/scripts/output/*.log
      record_log_file_path: true
  processors:
    json_parse:
      type: parse_json
  service:
    pipelines:
      nexus_pipeline:
        receivers: [nexus_app]
        processors: [json_parse]
metrics:
  receivers:
    hostmetrics:
      type: hostmetrics
      collection_interval: 60s
  service:
    pipelines:
      default_pipeline:
        receivers: [hostmetrics]

# Apply
sudo systemctl restart google-cloud-ops-agent

🔥 Recency, legacy agents EOL

The legacy Logging Agent and Monitoring Agent were deprecated in 2023 and reach end of support October 2024 / Q1 2025. New installs must use Ops Agent. If you find google-fluentd or stackdriver-agent on the VM, that is an upgrade you owe yourself. Source: cloud.google.com/stackdriver/docs/deprecations.

6.8 Audit logs (cross-reference)

See Section 2.9. Admin Activity is always on, free, 400-day retention.

6.9 Common alerting patterns

VM down, alert on compute.googleapis.com/instance/up = 0 for 3 minutes.
Disk usage, alert on agent.googleapis.com/disk/percent_used > 85% for 5 minutes.
Memory pressure, alert on agent.googleapis.com/memory/percent_used > 90% for 5 minutes.
Auth failures, log-based metric on Failed password, alert when rate > 5/min.
Cost spike, billing budget at 50%/75%/100% with email + Pub/Sub.

6.10 Observability cost

Logging: $0.50/GiB ingest, free egress to bucket, then storage $0.01/GiB-mo after 30 days for default bucket. Monitoring: free for resource metrics, $0.2580/MiB for chargeable metrics. Trace: $0.20 per million spans. Profiler: free.

6.11 Shipping /opt/nexus logs to Cloud Logging

Section 6.7's config is the right answer. Two patterns to know:

Structured JSON logs are auto-parsed when the file extension is .log and the line is valid JSON. The fields become labels.
Python's standard logging can write directly via google-cloud-logging's handler, no Ops Agent needed for that path.

📝 Code, Python direct logging into Cloud Logging

import logging
from google.cloud import logging as cloud_logging

cloud_logging.Client().setup_logging(log_level=logging.INFO)

logger = logging.getLogger("nexus.runner")
logger.info("job_started", extra={"json_fields": {"job_id": "abc"}})

Figure 6.1, observability pipeline from nexus-vm Python framework into Cloud Logging, Monitoring, and downstream sinks.

✅ Production checklist, Observability

Ops Agent installed and running on nexus-vm
/opt/nexus/logs ingested into Cloud Logging via Ops Agent files receiver
Structured JSON used for all application log lines
Log-based metrics for: error rate, auth failures, slow queries
Alert policies for: VM down, disk > 85%, mem > 90%, error spike
Notification channels: Robert email + Slack #ops
Uptime check on morpheus.callbrightside.com from 6 regions
Log retention: _Default 30d, custom audit bucket 365d
Sink to GCS for compliance archive (Coldline, lifecycle)
Quarterly review of unused dashboards and noisy alerts

🎓 FOR NEW HIRE, Observability mental model

Logs answer "what happened." Metrics answer "is it normal." Traces answer "where did the request spend time." Errors answer "what's broken." Start with logs and metrics; learn traces when you need them. The first place to look during an incident is Logs Explorer, filter to severity>=ERROR and the time window you care about. The Logs Explorer query language is documented at cloud.google.com/logging/docs/view/logging-query-language.

📝 7. APIs, Auth, SDKs MEDIUM PRIORITY

Every Google Cloud product is an HTTP API. Every API call passes through three checkpoints: API enablement (is the service turned on for this project), authentication (who is calling), and authorization (Section 2 IAM). Understanding the layers makes 401/403 errors trivial.

7.1 APIs catalog and enablement

Each API is identified by a service name like compute.googleapis.com, storage.googleapis.com, secretmanager.googleapis.com, aiplatform.googleapis.com. APIs must be explicitly enabled per project before any call works.

📝 Code, manage API enablement

# List enabled APIs
gcloud services list --enabled

# Enable an API
gcloud services enable secretmanager.googleapis.com

# Disable (will fail if resources exist)
gcloud services disable bigtable.googleapis.com

7.2 OAuth 2.0 flows

Flow	Use
Authorization Code	Web apps acting on behalf of a user
Authorization Code + PKCE	Native and mobile apps
Client Credentials (JWT bearer)	Service accounts
Implicit (legacy)	Avoid
Device flow	TVs, CLI on a remote box

7.3 Application Default Credentials (ADC)

The discovery order Google client libraries follow when looking for credentials:

GOOGLE_APPLICATION_CREDENTIALS env var pointing at a JSON key file
gcloud user credentials in ~/.config/gcloud/application_default_credentials.json
Attached service account on a GCE VM, Cloud Run, Cloud Functions, App Engine, GKE Workload Identity
External account (WIF) configured via gcloud iam workload-identity-pools create-cred-config

Use gcloud auth application-default login for local development. On nexus-vm, no env var is needed; the attached SA is auto-discovered via the metadata service.

7.4 gcloud config and named profiles

📝 Code, multi-account gcloud profiles

gcloud auth login                                  # browser flow
gcloud config configurations create bsp-prod
gcloud config set project bsp-prod
gcloud config set compute/zone us-central1-a
gcloud config set account robert.dove@callbrightside.com

gcloud config configurations list
gcloud config configurations activate bsp-prod

7.5 Cloud Shell vs local development

Cloud Shell is a free Linux VM in your browser pre-loaded with gcloud, kubectl, terraform, docker, python, node. Persists 5 GB of $HOME. Sessions auto-expire after 60 minutes idle. Useful when your local machine doesn't have gcloud, or when you want to test as a different identity without polluting local config.

7.6 Python SDK clients

Library	Install	For
google-cloud-storage	pip install google-cloud-storage	GCS
google-cloud-secret-manager	pip install google-cloud-secret-manager	Secret Manager
google-cloud-compute	pip install google-cloud-compute	Compute Engine API
google-cloud-logging	pip install google-cloud-logging	Cloud Logging
google-cloud-monitoring	pip install google-cloud-monitoring	Cloud Monitoring
google-cloud-pubsub	pip install google-cloud-pubsub	Pub/Sub
google-cloud-aiplatform	pip install google-cloud-aiplatform	Vertex AI
google-cloud-bigquery	pip install google-cloud-bigquery	BigQuery

7.7 REST vs gRPC

Google client libraries default to gRPC where supported (faster, streaming, smaller wire). REST is the fallback (firewall friendlier, easier to debug with curl). Most products support both; the Python SDK abstracts the choice. Compute Engine, Cloud SQL Admin, and a handful of older APIs are REST-only.

7.8 API versioning

Versions follow v1, v1beta1, v1alpha1. Beta is supported for production but breaking changes may occur. Alpha is allowlist-only. Pin to a specific version in your client library imports to avoid surprises.

7.9 Quotas and rate limiting

Every API has per-minute and per-day quotas. View at IAM & Admin → Quotas & System Limits. Common surprises: Compute Engine API Persistent disks (GB) regional quota, Cloud Logging Write API requests per minute, Cloud Functions Concurrent function executions. Request increases via the console; turnaround is hours-to-days.

7.10 Backoff and retry patterns

📝 Code, exponential backoff with the Google client library

from google.api_core import retry
from google.cloud import storage

custom_retry = retry.Retry(initial=1.0, multiplier=2.0, maximum=30.0, deadline=300.0)

client = storage.Client()
bucket = client.bucket("bsp-backups")
blob = bucket.blob("daily.tar.gz")
blob.upload_from_filename("daily.tar.gz", retry=custom_retry)

⚠️ Gotcha, idempotency and retries

Auto-retry only works safely on idempotent operations (GET, PUT with full payload, DELETE). For non-idempotent POSTs (create instance), wrap in requestId to dedupe. Compute Engine accepts a requestId on most insert operations.

✅ Production checklist, APIs

Only the APIs you actually use are enabled (audit quarterly)
nexus-vm runs as a dedicated SA, not the default Compute SA
ADC discovery used everywhere, no hardcoded key paths in code
gcloud profiles separate prod from sandbox
Pinned client library versions in requirements.txt with hashes
Exponential backoff for any rate-limited API
Quota dashboards monitored, alerts at 80% utilization

🎓 FOR NEW HIRE, the Python + GCP starter kit

You will live mostly in Python. The "official" Google client libraries follow a consistent shape: Client object → resource methods. Read cloud.google.com/python/docs/reference as your bookmark page. Bash and gcloud are the second language for ops scripting. Go and TypeScript surface in two contexts only: Cloud Functions / Cloud Run if we go serverless (TS or Python or Go), and the Cloud SQL Auth Proxy / Ops Agent (Go internals). You can ship effective work for years on Python + Bash + a little gcloud.

🚀 8. Build, Deploy, IaC MEDIUM PRIORITY

How code gets from a git push to running on production. The BSP nexus-vm stack today is updated by SSH + git pull + systemd restart. The mature future is Cloud Build → Artifact Registry → deploy.

8.1 Cloud Build

Hosted CI. Each build runs in a sandbox using a sequence of Docker steps defined in cloudbuild.yaml. Triggers on git push (GitHub, GitLab, Bitbucket, Cloud Source Repositories) or webhook. Outputs land in Artifact Registry, GCS, or anywhere the build calls.

📝 Code, minimal cloudbuild.yaml for nexus-vm Python sync

# cloudbuild.yaml
steps:
  - name: "python:3.11-slim"
    entrypoint: bash
    args:
      - -c
      - "pip install -r requirements.txt && python -m pytest -q"
  - name: "gcr.io/cloud-builders/gcloud"
    args: ["compute", "ssh", "nexus-vm", "--zone=us-central1-a",
           "--command=cd /opt/nexus && git pull && sudo systemctl restart nexus.service"]
options:
  logging: CLOUD_LOGGING_ONLY
timeout: "600s"

8.2 Artifact Registry

The successor to Container Registry (gcr.io). Holds container images, plus Maven, npm, Python (PyPI-style), Apt, Yum, Go module, generic file repos. Regional or multi-regional. Per-repo IAM. Vulnerability scanning available (Container Analysis API). Source: cloud.google.com/artifact-registry/docs.

🔥 Recency, Container Registry sunset

Container Registry (gcr.io) was deprecated in 2023 and is being shut down. New images go to Artifact Registry. Existing gcr.io/PROJECT/image URLs auto-redirect via the pkg.dev Artifact Registry mirror, but you should migrate explicitly. Run the migration tool: gcloud artifacts docker upgrade migrate.

8.3 Cloud Deploy

Managed delivery pipeline (continuous delivery). Stages a release through a chain of environments (dev → staging → prod) with manual or automatic promotion gates. Native targets are GKE, Cloud Run, and recently GCE MIG. For a single VM, the value is lower; we lean on Cloud Build directly today.

8.4 IaC choices: Terraform, Deployment Manager, gcloud, Config Connector

Tool	Status	Use
Terraform	Industry standard, recommended	Multi-cloud, broad community
Deployment Manager	Deprecated 2024, EOL	Legacy projects only, migrate
gcloud / scripts	Active	One-offs, ad hoc
Config Connector	Active	Manage GCP from inside K8s, GitOps
Pulumi	Active third-party	If you prefer real code over HCL

📝 Code, minimal Terraform for nexus-vm

# main.tf
terraform {
  required_version = ">= 1.5"
  required_providers {
    google = { source = "hashicorp/google", version = "~> 5.0" }
  }
  backend "gcs" {
    bucket = "bsp-tfstate"
    prefix = "nexus/prod"
  }
}

provider "google" {
  project = "bsp-prod"
  region  = "us-central1"
}

resource "google_compute_instance" "nexus_vm" {
  name         = "nexus-vm"
  machine_type = "n2-standard-2"
  zone         = "us-central1-a"
  boot_disk { initialize_params { image = "debian-cloud/debian-12", size = 50 } }
  network_interface {
    network    = google_compute_network.bsp_prod_vpc.id
    subnetwork = google_compute_subnetwork.nexus_subnet.id
    access_config { nat_ip = google_compute_address.nexus_static.address }
  }
  service_account {
    email  = google_service_account.nexus_runner.email
    scopes = ["cloud-platform"]
  }
  shielded_instance_config { enable_secure_boot = true, enable_vtpm = true, enable_integrity_monitoring = true }
  tags = ["web", "ssh-iap"]
}

8.5 Cloud Source Repositories

Google's git hosting. Free for up to 5 users, 50 GB. Mirrors GitHub repos for inside-VPC pulls. Mostly used as a Cloud Build source mirror. Today BSP source lives on GitHub; CSR is optional.

8.6 GitHub Actions / GitLab CI integration via WIF

See Section 2.4. The keyless path replaces uploading a service account JSON key to GitHub. The official action is google-github-actions/auth@v2.

📝 Code, GitHub Actions step using WIF

# .github/workflows/deploy.yml
- uses: google-github-actions/auth@v2
  with:
    workload_identity_provider: "projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/gh-pool/providers/gh-provider"
    service_account: "ci-deployer@bsp-prod.iam.gserviceaccount.com"

- run: gcloud compute ssh nexus-vm --zone=us-central1-a --command="cd /opt/nexus && git pull && sudo systemctl restart nexus"

8.7 Terraform state on GCS

Use a dedicated GCS bucket as Terraform's remote backend. Enable versioning, locking via TF state lock (default for GCS backend in TF 1.6+). Restrict the bucket IAM to only the CI service account and humans who need to import state by hand.

8.8 Deployment patterns for nexus-vm

Today, manual SSH + git pull + systemctl restart. Works for a single operator.
Step up, Cloud Build trigger on push to main runs tests, then SSHes in to deploy. Adds an audit trail.
Mature, build a custom image weekly, swap a MIG of size 1, drain old VM. Adds rollback.

8.9 Rollbacks

Without IaC, rollback is "git revert + redeploy." With snapshots and instance templates, rollback is "revert MIG to template version v(N-1)" which can complete in minutes.

✅ Production checklist, Build & Deploy

All infra in Terraform with state on GCS, versioning on, locked
Cloud Build trigger on main, runs tests before deploy
CI uses Workload Identity Federation, no JSON keys
Artifact Registry with vulnerability scanning enabled
No images on legacy gcr.io (run upgrade tool)
Build logs streamed to Cloud Logging, retained 90 days
Deploy is reversible inside 5 minutes (snapshot or git revert)
Rollback drill quarterly

🎓 FOR NEW HIRE, deploy paths

Today: SSH in, git pull in /opt/nexus, sudo systemctl restart nexus.service. Always look at journalctl -u nexus.service -n 100 -f after a deploy. Tomorrow: a GitHub Actions workflow does it for you when a PR merges. Either way, the pattern of "pull + restart + watch logs" is the same.

⚡ 9. Serverless LOWER PRIORITY

Serverless on GCP means "you bring code, Google runs it on demand." Lower priority for our single-VM stack today, but the right answer for many things we currently shoehorn into nexus-vm cron.

9.1 Cloud Run

The flagship. Containers (any language, any base image), HTTP and grpc, scales 0 to N. Pay per 100ms of CPU + memory while serving. Two flavors: services (long-running HTTP) and jobs (run to completion). Concurrency per container configurable, 1 to 1000. Default 80.

Feature	Detail
Cold start	~100ms-2s depending on image size
Request timeout	up to 60 minutes (services), 24 hours (jobs)
Memory	128 MiB to 32 GiB
CPU	1, 2, 4, 8 vCPU
Min instances	0 default; set > 0 to avoid cold starts at a cost
VPC connector	Direct VPC egress (preview/GA), or Serverless VPC Access
Auth	Public, IAP-protected, or invoker IAM

📝 Code, deploy a Python Cloud Run service from source

gcloud run deploy nexus-helper \
  --source=. --region=us-central1 \
  --allow-unauthenticated --memory=512Mi --cpu=1 \
  --service-account=nexus-runner@bsp-prod.iam.gserviceaccount.com \
  --set-secrets=ANTHROPIC_API_KEY=anthropic-key:latest

9.2 Cloud Functions Gen 1 vs Gen 2

Gen 2 is Cloud Run under the hood with a function-shaped interface. Gen 1 is the legacy. New code goes to Gen 2 (longer timeouts, larger instances, concurrency > 1, better events). Gen 1 is in maintenance.

9.3 App Engine Standard vs Flex

Standard, language sandbox (Python, Node, Go, Java, PHP, Ruby), instant scale, but legacy programming model.
Flexible, your container on GCE behind a managed LB. Largely superseded by Cloud Run.

For new builds, default to Cloud Run unless you have an existing App Engine app.

9.4 Pub/Sub

Globally available message queue. Decouples producers and consumers. Two delivery modes: push (Pub/Sub posts to your endpoint) and pull (your worker fetches). At-least-once by default, exactly-once available with ordering keys + filtering. Schema validation, dead letter topics. Source: cloud.google.com/pubsub/docs.

9.5 Cloud Scheduler, Tasks, Workflows

Service	Use
Cloud Scheduler	Cron-as-a-service. Hits HTTP, Pub/Sub, App Engine.
Cloud Tasks	Per-item queue with rate limiting, dispatch retry, delay.
Workflows	YAML state machine for multi-step orchestration.

9.6 Eventarc

Event router that turns audit log events, GCS object writes, Pub/Sub messages, BigQuery jobs, and SaaS webhooks into Cloud Run / GKE invocations. Use for "when an object lands in this bucket, run that handler" without writing glue.

9.7 Cost models

Cloud Run: $0.0000180/vCPU-s + $0.000002/GiB-s + $0.40/M requests. First 240k vCPU-s and 450k GiB-s/month free.
Cloud Functions Gen 2: same as Cloud Run.
Pub/Sub: $40/TiB ingest, $40/TiB delivery. First 10 GiB/month free.
Cloud Scheduler: 3 jobs free, then $0.10/job-month.
Workflows: $0.01/1000 internal steps.

✅ Production checklist, Serverless

Cloud Run services use a non-default SA with least privilege
Sensitive Cloud Run services require auth (IAM invoker)
Min instances > 0 only where cold start is unacceptable
Pub/Sub topics have dead letter policies
Cloud Scheduler jobs idempotent and have a max retry policy
Workflows replace shell scripts when there are >3 steps with branching

🎓 FOR NEW HIRE, when to reach for serverless

Anything that runs on a schedule and finishes in under 10 minutes is a great Cloud Scheduler + Cloud Run job candidate. Anything that responds to events (an upload arrived, a Pub/Sub message landed) is Cloud Run or Functions. Anything that runs continuously and holds state is the VM. We default to "put it on the VM" today, but if you find yourself reaching for cron, ask yourself if Cloud Scheduler is nicer.

🏢 10. Project, Org, Billing MEDIUM PRIORITY

Projects are the unit of cost, quota, and IAM. Organizations are the unit of governance. Billing accounts pay the bills. The wiring matters more than people realize.

10.1 Resource hierarchy

Org → Folder (optional, can nest) → Project → Resources. Most BSP work happens in one project (bsp-prod). Recommended additions: bsp-sandbox for safe experiments, bsp-data for BigQuery and analytics with separate billing visibility.

10.2 Organization policies

Constraints applied at Org/Folder/Project level that override IAM. Examples:

compute.requireOsLogin, force OS Login on every VM
compute.disableSerialPortAccess, no serial console
iam.disableServiceAccountKeyCreation, no SA key downloads
storage.publicAccessPrevention, no allUsers reads
storage.uniformBucketLevelAccess, force UBLA on new buckets
compute.vmExternalIpAccess, allowlist external IPs
iam.allowedPolicyMemberDomains, restrict who can be granted IAM

📝 Code, set an org policy

gcloud resource-manager org-policies set-policy policy.yaml \
  --organization=ORG_ID

# policy.yaml
constraint: constraints/iam.disableServiceAccountKeyCreation
booleanPolicy: { enforced: true }

10.3 Custom org policy constraints

2023+ feature. Define your own constraint in CEL targeting any GCP resource field. Example: "every GCS bucket must be in us-central1." Source: cloud.google.com/resource-manager/docs/organization-policy/custom-constraints.

10.4 Quotas

Per-project, per-region, per-API. Soft limits, increase via console request. Common single-VM ones to know:

Compute Engine: CPUs (regional)
Compute Engine: Persistent Disk SSD (GB) (regional)
Compute Engine: In-use IP addresses (regional)
Cloud Logging: Log entries per second

10.5 Billing accounts and BigQuery exports

Billing accounts pay one or many projects. Separate the production billing account from the sandbox so a runaway dev VM does not blow the prod budget. Enable BigQuery billing export for accurate per-resource cost analysis.

📝 Code, enable BigQuery billing export

# In the Console: Billing → Billing export → BigQuery export
# Creates a dataset like: PROJECT.billing_export

# Sample query: top 10 services by cost last 30 days
SELECT
  service.description AS service,
  ROUND(SUM(cost), 2) AS cost_usd
FROM `PROJECT.billing_export.gcp_billing_export_v1_BILLING_ID`
WHERE usage_start_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY service
ORDER BY cost_usd DESC
LIMIT 10;

10.6 Budgets and alerts

Set per-billing-account or per-project budgets with alerts at thresholds (50%, 75%, 90%, 100%, 150% of forecast). Notifications go to email and Pub/Sub. Pub/Sub can trigger automatic remediation (e.g. shut down sandbox VMs).

10.7 Asset Inventory and Recommender

Cloud Asset Inventory, queryable snapshot of every resource and IAM binding. Export to BigQuery for compliance reporting. Real-time feed via Pub/Sub.
Recommender, ML-driven suggestions per resource: idle VMs, oversized VMs, idle PDs, IAM role recommendations.
Active Assist, umbrella term for Recommender + Policy Intelligence + Insight feeds.

10.8 Project metadata: labels and tags

Two distinct concepts:

Labels, key/value strings for billing breakdown and search. Set per-resource. Examples: env=prod, component=nexus.
Tags, hierarchical key/value pairs that participate in IAM (deny policies, conditional bindings) and firewall rules. Set at Org or Project level.

✅ Production checklist, Project & Org

Org policies enforced for: requireOsLogin, disableServiceAccountKeyCreation, publicAccessPrevention, uniformBucketLevelAccess
Production isolated from sandbox in different projects with different billing accounts
Budgets set on every billing account with email + Pub/Sub alerts
BigQuery billing export running
Asset Inventory feed to a SecOps bucket for compliance
Quotas reviewed quarterly, requested increases preempt growth
Standard label taxonomy applied to every resource (env, component, owner)

🎓 FOR NEW HIRE, project anatomy

If you ever feel "I want to try this without risk," ask Robert to point you at the sandbox project. Production isolation is real. The first thing in any console session: check the project picker top-left and confirm you are in the right project. The number of incidents caused by being in the wrong project is non-trivial.

🛡️ 11. Security Operations MEDIUM PRIORITY

Security Operations on GCP is the sum of detection (Security Command Center), guardrails (Org Policy, Binary Auth, VPC-SC), and forensics (Audit Logs, Asset Inventory).

11.1 Security Command Center tiers

Tier	Cost	Capabilities
Standard	Free	Findings: Web Security Scanner, Sensitive Action Service, exposed assets
Premium	Per-vCPU + per-bucket pricing	+ Event Threat Detection, Container Threat Detection, VM Threat Detection, Posture, Compliance reports (CIS, PCI, HIPAA)
Enterprise	Higher tier	+ Mandiant threat intel, MISP integration, SOC features

11.2 Threat detection

Event Threat Detection, scans Cloud Logging for indicators (suspicious IAM, SSH brute force, malware DNS).
Container Threat Detection, GKE-only, runtime detection.
VM Threat Detection, agent-less host introspection on GCE VMs (memory scan).

11.3 Vulnerability scanning

Container Analysis API scans Artifact Registry images on push. Web Security Scanner runs against your App Engine / Cloud Run / GCE web apps to find OWASP issues. Free tier of SCC includes both at limited frequency.

11.4 Compliance reports

SCC Premium includes pre-built reports against frameworks: CIS GCP Benchmark v1.3, PCI DSS, HIPAA, NIST 800-53, ISO 27001, SOC 2, FedRAMP Moderate/High. Each report shows compliant vs non-compliant resources.

11.5 Binary Authorization and Container Analysis

Binary Auth gates GKE/Cloud Run/Anthos so that only signed, attested container images run. Pair with Container Analysis to require a "no high CVEs" attestation. Out of scope for single-VM nexus-vm but the pattern to know if we move to containers.

11.6 Cloud DLP (Sensitive Data Protection)

Detect and redact PII in text, images, and BigQuery. InfoTypes include US_SSN, EMAIL_ADDRESS, CREDIT_CARD_NUMBER, custom regex. Use during ingest of customer data into BSP analytics: scan a sample with DLP, decide whether the field is allowed.

11.7 Access Transparency and Approval

Access Transparency, logs every Google support engineer access to your data with reason. Available on Premium support+.
Access Approval, requires you to approve before a Google engineer can access your data, a few seconds delay added.

11.8 Control-plane CMEK

Beyond data CMEK (Section 5.5), control-plane CMEK encrypts metadata about your resources (config, IAM bindings) with your key in some products. Niche, but compliance-relevant.

✅ Production checklist, Security Operations

SCC Standard enabled on the org (free)
Web Security Scanner runs against morpheus.callbrightside.com weekly
Image vulnerability scanning on Artifact Registry repos
Audit logs archived to a write-only bucket with Bucket Lock
VPC-SC perimeter around production data buckets and Cloud SQL
Findings triaged inside 7 days for high, 24 hours for critical
Quarterly tabletop incident response exercise

🎓 FOR NEW HIRE, the security mindset

Default to least privilege. When you write code that needs a permission, grant the smallest predefined role you can find, or build a custom role. Treat every "could this credential leak" with paranoia. The cheapest way to get hacked is a leaked SA key in a public repo, the cheapest defense is Workload Identity Federation. Read the SCC findings tab once a week and learn what the org looks like to a defender.

💰 12. Cost HIGH PRIORITY

A nexus-vm-sized stack on GCP is cheap if you watch it, expensive if you don't. The expensive surprises are predictable; the cheap path is small habits.

12.1 Pricing models

Pay as you go, default. Per-second compute, per-GB storage, per-GB egress.
Sustained use discounts, automatic, kicks in >25% of the month for N1/E2 family.
Committed use discounts, 1- or 3-year, 37%-55% off list, opt-in.
Spot VMs, 60-91% off list, can be reclaimed.
Free tier, persistent monthly free amounts (e2-micro 1 in us-east1/us-west1/us-central1, 5 GB GCS, etc.).

12.2 Free tier

Service	Free per month
Compute Engine	1 e2-micro in us-central1/us-east1/us-west1, 30 GB pd-standard, 1 GB egress to most
GCS	5 GB Standard, 5,000 Class A ops, 50,000 Class B ops, 1 GB egress
Cloud Run	2M requests, 360,000 vCPU-s, 180,000 GiB-s
Cloud Functions	2M invocations, 400k GB-s, 200k GHz-s
Cloud Logging	50 GiB ingest, 30-day retention
Cloud Monitoring	All resource metrics + 150 MiB chargeable
Cloud Build	120 build-minutes/day
Pub/Sub	10 GiB
Secret Manager	6 active secrets, 10k access ops

12.3 Optimization strategies

Right-size, use Recommender's "Right-size VMs" insight monthly.
Lifecycle GCS, auto-tier to Nearline at 30 days, Coldline at 90, delete or archive at 365.
Schedule sandbox shutdowns, Cloud Scheduler stops dev VMs nights and weekends.
Buy CUDs, when steady state is locked in.
Tune log ingest, exclude noisy log lines via sink filters before they hit storage.
Avoid cross-region egress, keep workloads in the same region as their data.

12.4 Billing exports to BigQuery

See Section 10.5. The detailed export includes per-resource cost broken down by SKU.

12.5 Common cost surprises

Surprise	Why	Mitigation
Egress charges	Cross-region or to-internet egress is $0.01-$0.12/GB	Keep data in-region, use Cloud CDN, compress responses
Cloud NAT data processing	$0.045/GB processed + $0.0045/hr per IP	Use Private Google Access for googleapis.com endpoints
Log ingest	$0.50/GiB beyond 50 GiB	Drop noisy log lines via Sink filter exclusions
Snapshot accumulation	Snapshots compound until you delete	Lifecycle on snapshot schedule (e.g. keep 7 daily, 4 weekly, 12 monthly)
Idle static IP	$0.005/hr while not attached to running VM	Release unused IPs
Cloud Logging rehydration	Fetching logs older than retention is expensive	Stream to GCS or BQ before retention cliff
Cloud SQL HA when not needed	2x cost	Disable HA on dev/sandbox

12.6 nexus-vm specific cost analysis

Rough monthly estimate, assuming n2-standard-2 (2 vCPU, 8 GB RAM) running 24/7 in us-central1, 50 GB pd-balanced, 1 static IP, ~5 GB Cloud Logging, ~50 GB GCS Standard for backups:

Item	Detail	$/month
n2-standard-2 vCPU	2 vCPU x 730 hrs x $0.0317	~$46.30
n2-standard-2 RAM	8 GiB x 730 hrs x $0.00425	~$24.83
Sustained use discount	~10% off N2 (auto)	-$7.10
50 GB pd-balanced boot disk	50 x $0.10	~$5.00
Static external IP (in use)	730 hrs x $0.000	~$0.00
Egress to internet	~10 GB x $0.085 (assumes US-to-most)	~$0.85
Cloud Logging	5 GB ingest, free under 50 GiB	$0.00
GCS Standard backups	50 GB x $0.020	~$1.00
Snapshot storage	~30 GB compressed x $0.026	~$0.78
Secret Manager	~10 secrets, 1k ops/mo	~$0.06
Total estimate	List minus SUD	~$71.72

💡 Insight, where a 1-yr CUD pays back

A 1-year resource-based CUD on 2 vCPU + 8 GB RAM in us-central1 saves ~37% off N2 list. That is roughly $26/mo savings on a $70 base, paying back inside the first month and locking in the rate for 12 months. Caveat: you keep paying for the committed amount even if you delete the VM.

⚠️ Gotcha, egress is the most-asked-about line

If your monthly bill jumps by $50 unexpectedly, look at egress first. A misconfigured backup that pulls 600 GB to a non-Google destination is $50 of egress out of nowhere. Run the BQ billing query by sku filter on %egress%.

✅ Production checklist, Cost

Budgets per project with 50/75/100% alerts
BigQuery billing export running, dashboards built
1-year CUD evaluated for steady-state nexus-vm
GCS lifecycle on every long-lived bucket
Snapshot retention policy (max 30 daily snapshots)
Idle static IPs released monthly
Logging exclusion filters for noisy services (e.g. health check 200s)
Sandbox auto-stop schedule via Cloud Scheduler
Quarterly Recommender review

🎓 FOR NEW HIRE, cost discipline in 90 seconds

Before you create anything, ask: how much does this cost per month if I forget to delete it? GCP bills by the second; a forgotten dev VM at $50/month is $1.65/day. The dashboard at Console → Billing → Reports tells you in <30 seconds. Bookmark it. Look at it weekly.

🖥️ 13. Console UI LOWER PRIORITY

The web console at console.cloud.google.com is mostly self-explanatory, but a handful of patterns save real time.

13.1 URL structure

Every page has a deep link. https://console.cloud.google.com/compute/instances?project=PROJECT_ID jumps directly to the instance list for a project. Bookmark these for the resources you visit daily.

13.2 Cloud Shell

Click the >_ icon top-right to open a Linux shell in your browser, no install. Pre-loaded with gcloud, kubectl, terraform, docker, python, node, vim. $HOME persists 5 GB. Sessions expire after 60 minutes idle.

13.3 Activity feed

Console → Home → Activity. A timeline of every Admin Activity audit event in the project. Useful first stop for "who changed what."

13.4 Search bar

Top-bar search auto-completes resource names across services. Search for nexus-vm and you get the GCE instance, related disks, snapshots, and any logs entries that mention it. Faster than navigating menus.

13.5 Dashboard customization

Console → Home → Dashboard. Add/remove tiles. Pin Monitoring dashboards. Useful for an at-a-glance ops view.

13.6 Mobile app

"Cloud Console" app for iOS/Android. Useful for: viewing alerts, restarting a VM in a pinch, checking the bill on a Sunday morning. Do not run major IaC changes from a phone.

🎓 FOR NEW HIRE, console productivity

Three habits: (1) confirm the project picker every time you open a new tab, (2) press / to jump to search, (3) star resources to pin them in the navigation drawer. The console is fine for exploration; for any change that needs an audit trail, prefer gcloud or Terraform so the change is reviewable.

🔍 14. Troubleshooting HIGH PRIORITY

When something is on fire, a runbook beats panic. This section is the runbook for the failure modes you will actually hit on nexus-vm.

14.1 Cloud Debugger deprecation

🔥 Recency, Cloud Debugger removed

Cloud Debugger was sunset in May 2023. Replacement: Cloud Profiler for performance, plus modern OpenTelemetry-based debugging in your IDE. If you find docs referencing Cloud Debugger, ignore them.

14.2 Connectivity Tests and the NIC introspection

Console → Network Intelligence → Connectivity Tests. Define a source (VM, IP, internet) and destination, run a simulated path. Tells you which firewall rule, route, or peering blocked the traffic. Saves hours of guessing.

📝 Code, run a connectivity test from CLI

gcloud network-management connectivity-tests create nexus-from-cf \
  --source-ip-address=104.16.0.1 \
  --destination-instance=projects/PROJECT_ID/zones/us-central1-a/instances/nexus-vm \
  --destination-port=443 --protocol=TCP

gcloud network-management connectivity-tests describe nexus-from-cf

14.3 Troubleshooter wizards

The console has wizards for: IAM "why can't user X do Y", VPC "why can't VM A reach B", LB "why is health check failing". Run them before guessing.

14.4 Quotas page

IAM & Admin → Quotas & System Limits. When an API call returns 429, look here first. Filter to the service whose quota you suspect.

14.5 Error code reference

Code	Meaning	First check
400	Bad request	Validate request body, region, zone names
401	Unauthenticated	ADC discovery, expired token, wrong gcloud config
403	Permission denied	Missing IAM role; check exact permission string in error
404	Not found	Resource name typo, wrong project, wrong region
409	Conflict	Concurrent modification, wait and retry
429	Too many requests	Quota or rate limit; check Quotas page
500	Internal error	Retry with backoff; check status.cloud.google.com
503	Service unavailable	Regional outage; check status page; retry

14.6 IAM troubleshooter step-by-step

Copy the exact permission string from the 403 (e.g. compute.instances.start).
Open Console → IAM & Admin → Troubleshoot.
Enter the user/SA email and the resource (instance URL).
Click "Check access." It returns the inherited bindings and the missing permission.
Grant the smallest predefined role (look it up in cloud.google.com/iam/docs/understanding-roles) that contains the permission.
Re-run the failing call. If still 403, check Org Policy and Deny policies.

✅ Production checklist, Troubleshooting

Connectivity Tests scripted for the top 5 failure paths
Status page (status.cloud.google.com) bookmarked
Runbook with "first 5 minutes" steps for: VM unreachable, 5xx spike, cost spike, auth failures
On-call rotation acknowledged and tested for alert delivery
Monthly tabletop drill on a different scenario each time

🎓 FOR NEW HIRE, the calmness algorithm

(1) Read the error literally. (2) Map it to a section in this doc. (3) Run the troubleshooter / connectivity test before guessing. (4) Never paste your fix into production until you can articulate the failure mode in one sentence. (5) When stuck after 30 minutes, ask Robert. Cost of asking: 0. Cost of cascading the wrong fix: hours.

🎯 15. Integration Points with nexus-vm Stack HIGHEST PRIORITY

The longest section by design. Everything above maps to abstractions; this maps to the actual production stack at 34.55.179.122 and the systems it touches. If a future Robert reads only one section to recover from a disaster, this is the one.

15.1 The current stack, in one screen

Layer	Component	State today	Where it lives
DNS & Edge	Cloudflare zone `a87220882ed631dd4dfb`	Production	Cloudflare
Compute	GCE VM `nexus-vm`	Production, single VM	us-central1-a, IP 34.55.179.122
Filesystem	`/opt/nexus` Python framework	Production	nexus-vm boot disk
HTTP service	Context Harness on `localhost:8765`	Production	nexus-vm, systemd-managed
RAG store	Zeus, 19,679 chunks, text-embedding-3-small	Production	nexus-vm filesystem + index
Web UI	morpheus.callbrightside.com	Production	nexus-vm + Cloudflare
WP integration	claude-api → bricks.callbrightside.com WP REST	Production	Hostinger u227696829
SSH	`~/.ssh/google_compute_engine` + dovew user	Production	nexus-vm metadata
Secrets (today)	OS env vars / .env files on VM	To migrate	To Secret Manager
Backups (today)	None automated	To create	To GCS + snapshot policy

15.2 SSH access patterns, where we are vs where we should be

Today. Robert SSHes via ssh -i ~/.ssh/google_compute_engine dovew@34.55.179.122. The public key is in instance metadata under ssh-keys. This works but has three weaknesses: (a) the VM has a public IP that any bot can scan, (b) revoking access requires editing metadata, (c) audit logs show dovew not robert.dove@callbrightside.com.

Bulletproof target. No public IP. Access via IAP TCP forwarding (Section 1.9.3) gated by OS Login (Section 1.9.1). Audit logs show the Google identity. Revoking access is one IAM binding removal.

📝 Code, the migration plan

# 1. Grant Robert OS Login + IAP roles
gcloud projects add-iam-policy-binding bsp-prod \
  --member=user:robert.dove@callbrightside.com --role=roles/compute.osAdminLogin
gcloud projects add-iam-policy-binding bsp-prod \
  --member=user:robert.dove@callbrightside.com --role=roles/iap.tunnelResourceAccessor

# 2. Add the IAP firewall rule
gcloud compute firewall-rules create allow-ssh-iap \
  --network=bsp-prod-vpc --direction=INGRESS --action=ALLOW \
  --rules=tcp:22 --source-ranges=35.235.240.0/20 --target-tags=ssh-iap

# 3. Test IAP works while public IP is still attached
gcloud compute ssh nexus-vm --zone=us-central1-a --tunnel-through-iap

# 4. Enable OS Login per-instance
gcloud compute instances add-metadata nexus-vm \
  --zone=us-central1-a --metadata enable-oslogin=TRUE

# 5. After 7 days of stable IAP-only operation, drop the public IP
gcloud compute instances delete-access-config nexus-vm \
  --zone=us-central1-a --access-config-name="External NAT"

⚠️ Gotcha, do not drop the public IP without first wiring up the LB

If the VM goes private, public traffic for morpheus.callbrightside.com cannot reach it directly. You need a global Application Load Balancer with the VM as a backend, the LB has the public IP, and the VM only accepts traffic from the LB plus IAP. Plan the LB before pulling the IP.

15.3 Service accounts for nexus-vm and external integration

Recommended SA design:

nexus-runner@bsp-prod.iam.gserviceaccount.com, attached to the VM. Roles:
- roles/secretmanager.secretAccessor on each app secret (scope to specific secrets, not project-wide)
- roles/storage.objectAdmin on the backup bucket only
- roles/logging.logWriter
- roles/monitoring.metricWriter
- roles/cloudtrace.agent
- roles/errorreporting.writer
cf-dns-bot@bsp-prod.iam.gserviceaccount.com, used by automation that touches Cloudflare. Permissions live in Cloudflare's API tokens, GCP role only needs roles/secretmanager.secretAccessor on the Cloudflare token secret.
wp-integration@bsp-prod.iam.gserviceaccount.com, identity for code that calls the Hostinger WP REST API. Stores BRICKS_WP_APP_PASSWORD in Secret Manager.

📝 Code, attach a fresh SA to nexus-vm

# Create the SA
gcloud iam service-accounts create nexus-runner --display-name="Nexus VM Runner"

# Grant the secrets it needs
for secret in ANTHROPIC_API_KEY CLOUDFLARE_API_TOKEN BRICKS_WP_APP_PASSWORD VAPI_API_KEY OPENAI_API_KEY; do
  gcloud secrets add-iam-policy-binding $secret \
    --member=serviceAccount:nexus-runner@bsp-prod.iam.gserviceaccount.com \
    --role=roles/secretmanager.secretAccessor
done

# Switch the VM (requires VM stop)
gcloud compute instances stop nexus-vm --zone=us-central1-a
gcloud compute instances set-service-account nexus-vm \
  --zone=us-central1-a \
  --service-account=nexus-runner@bsp-prod.iam.gserviceaccount.com \
  --scopes=cloud-platform
gcloud compute instances start nexus-vm --zone=us-central1-a

15.4 GCS as the backup destination for /opt/nexus

Pick or create a regional bucket in us-central1: gs://bsp-nexus-backups. Versioning on, lifecycle to Nearline at 30 days, Coldline at 90, delete at 365 (Section 4.2). UBLA on. Restrict IAM to the nexus-runner SA + a humans-only audit role.

📝 Code, daily /opt/nexus backup script

# /opt/nexus/scripts/backup_daily.sh
#!/bin/bash
set -euo pipefail
DATE=$(date +%Y%m%d_%H%M%S)
ARCHIVE=/tmp/nexus-${DATE}.tar.zst

tar --zstd -cf $ARCHIVE \
  --exclude='/opt/nexus/.git' \
  --exclude='/opt/nexus/**/__pycache__' \
  --exclude='/opt/nexus/**/*.pyc' \
  /opt/nexus

gcloud storage cp $ARCHIVE gs://bsp-nexus-backups/daily/${DATE}.tar.zst
rm $ARCHIVE

# Optional: write a sentinel for the latest successful run
gcloud storage cp /dev/stdin gs://bsp-nexus-backups/_latest.txt <<< "$DATE"

# Cron entry
# 0 3 * * * /opt/nexus/scripts/backup_daily.sh >> /var/log/nexus-backup.log 2>&1

15.5 Cloud SQL evaluation for the WP database

The WordPress staging at bricks.callbrightside.com runs on Hostinger's MySQL. If we ever decide to bring WP on-platform (full GCP), the path is:

Create a Cloud SQL MySQL 8.0 instance, 2 vCPU, 8 GB RAM, 100 GB SSD, HA enabled, automated backups, PITR enabled, maintenance window Saturday 03:00 UTC.
Migrate via Database Migration Service (DMS). Set up continuous replication, validate, cutover.
Update wp-config.php on a GCE-hosted PHP setup or App Engine to point at the Cloud SQL Auth Proxy socket.
Use Secret Manager for the DB password.
Take Cloud SQL backups daily, export weekly to GCS for cross-region DR.

Estimated incremental cost: ~$120/month for the HA Cloud SQL plus ~$5 storage. Decision deferred until WP scale or compliance forces it.

15.6 GCE firewall rules and hardening

The minimum firewall set for nexus-vm in production:

Name	Direction	Source	Ports	Targets
allow-ssh-iap	INGRESS	35.235.240.0/20	tcp:22	tag ssh-iap
allow-https-cf	INGRESS	Cloudflare CIDR list	tcp:443	tag web
allow-internal	INGRESS	10.10.0.0/24	all	VPC-internal
deny-all-ingress	INGRESS	0.0.0.0/0	all	(catch-all, priority 65534)

OS-level hardening additionally: ufw or nftables mirroring the GCP firewall, fail2ban for SSH, automatic unattended upgrades enabled, root SSH disabled, password auth disabled, public key only.

📝 Code, baseline OS hardening on Debian/Ubuntu

sudo apt update && sudo apt install -y unattended-upgrades fail2ban
sudo dpkg-reconfigure -plow unattended-upgrades

# /etc/ssh/sshd_config tweaks
PermitRootLogin no
PasswordAuthentication no
KbdInteractiveAuthentication no
ChallengeResponseAuthentication no
ClientAliveInterval 300
ClientAliveCountMax 2

sudo systemctl reload ssh

15.7 Static IP status and snapshots schedule

📝 Code, verify nexus-vm IP is static, then create a snapshot policy

# Check static IP
gcloud compute addresses list --filter="address=34.55.179.122"

# Promote ephemeral to static if needed
gcloud compute addresses create nexus-vm-static \
  --addresses=34.55.179.122 --region=us-central1

# Create a daily snapshot schedule, retain 7 daily / 4 weekly / 12 monthly
gcloud compute resource-policies create snapshot-schedule nexus-daily \
  --region=us-central1 \
  --max-retention-days=14 \
  --start-time=07:00 --daily-schedule \
  --on-source-disk-delete=keep-auto-snapshots \
  --storage-location=us

# Attach to the boot disk
gcloud compute disks add-resource-policies nexus-vm \
  --zone=us-central1-a --resource-policies=nexus-daily

15.8 Load balancer + GCE backend for production scale

If we add a global Application Load Balancer for morpheus.callbrightside.com:

Frontend: HTTPS, managed cert for morpheus.callbrightside.com, Cloud Armor policy attached.
Backend: instance group of size 1 containing nexus-vm. Health check on :8765/healthz (the Context Harness exposes this).
URL map: a single default backend service today, can route paths later.
Cloudflare in front of the LB or removed; pick one CDN.

Benefit: TLS terminates on Google, can drop nexus-vm public IP, auto-scale to a MIG of 2 when needed without architecture rework.

15.9 Backup strategies, layered

Layer	Frequency	RPO	RTO	Method
Boot disk snapshots	Daily	24h	15-30 min	Snapshot schedule (Section 15.7)
/opt/nexus tarball	Daily	24h	5-10 min	Cron + GCS (Section 15.4)
Git remote	On every push	Minutes	1-2 min	GitHub origin
Secrets	On rotation	~immediate	1 min	Secret Manager versions
External APIs	N/A	N/A	N/A	Service-side responsibility (Hostinger, Cloudflare)

15.10 Monitoring nexus-vm via Ops Agent

See Section 6.7 for the Ops Agent install. The minimum alerts to wire up:

Instance up = 0 for 3 minutes
Disk usage > 85% for 5 minutes
Memory usage > 90% for 5 minutes
Context Harness :8765 uptime check fails for 2 minutes
morpheus.callbrightside.com uptime check fails for 2 minutes
Error rate from /opt/nexus logs > 5/min for 5 minutes
Daily snapshot did not complete (custom log-based metric on snapshot job)

15.11 Secret Manager rotation for app secrets

Critical secrets to rotate on schedule:

BRICKS_WP_APP_PASSWORD, application password in WordPress for the claude-api integration. Rotate every 90 days.
CLOUDFLARE_API_TOKEN, scoped to zone a87220882ed631dd4dfb. Rotate every 90 days.
ANTHROPIC_API_KEY, the Anthropic API key powering Daniel AI / Nexus calls. Rotate every 90 days.
VAPI_API_KEY, Vapi (Daniel AI on (913) 963-9817, assistant e2920d04). Rotate every 90 days.
OPENAI_API_KEY (text-embedding-3-small for Zeus), rotate every 90 days.

Rotation pattern: add new version, update consumer code to read latest, monitor for ~24 hours, disable old version (do not destroy yet), verify, destroy after 7 days.

15.12 Disaster recovery, nexus-vm dies, how to rebuild

Assume the VM is gone (deleted, host failure, regional outage). Recovery procedure, ordered:

Confirm in the console that the instance is in fact gone, not just stopped (gcloud compute instances list --filter=name=nexus-vm).
If the instance was deleted but the boot disk was retained (--keep-disks=boot flag was on, default not), restart from the disk: gcloud compute instances create nexus-vm --zone=us-central1-a --boot-disk=nexus-vm with the prior SA, tags, network.
If the boot disk is gone, restore from the latest snapshot: gcloud compute disks create nexus-vm --source-snapshot=nexus-daily-LATEST --zone=us-central1-a then create instance pointing at it.
Reattach the static IP 34.55.179.122 via --address=nexus-vm-static.
Validate that /opt/nexus is intact, run systemctl status nexus.service context-harness.service.
If the entire region is down, restore in a different zone of us-central1; if all of us-central1 is down, the snapshot is multi-regional so you can build in us-east1 (different external IP, update Cloudflare DNS).
Smoke test: curl https://morpheus.callbrightside.com, run a Zeus search, check Context Harness /healthz.
Rotate the Anthropic, Cloudflare, and BRICKS_WP_APP_PASSWORD secrets just in case the disaster was a credential compromise.

📝 Code, the fast rebuild script (run from any machine with gcloud)

#!/bin/bash
set -euo pipefail

PROJECT=bsp-prod
ZONE=us-central1-a
VM=nexus-vm
SA=nexus-runner@${PROJECT}.iam.gserviceaccount.com

# 1. Find latest snapshot
LATEST=$(gcloud compute snapshots list \
  --filter="name~nexus-daily AND status=READY" \
  --sort-by=~creationTimestamp --limit=1 --format="value(name)")
echo "Restoring from snapshot: $LATEST"

# 2. Recreate boot disk
gcloud compute disks create $VM \
  --source-snapshot=$LATEST --zone=$ZONE --type=pd-balanced

# 3. Create instance from existing disk
gcloud compute instances create $VM \
  --zone=$ZONE --machine-type=n2-standard-2 \
  --disk=name=${VM},boot=yes,auto-delete=yes \
  --service-account=$SA --scopes=cloud-platform \
  --address=nexus-vm-static \
  --tags=web,ssh-iap \
  --shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring

# 4. Wait for boot, smoke test
sleep 60
gcloud compute ssh $VM --zone=$ZONE --tunnel-through-iap --command="systemctl status nexus.service"

15.13 Complete nexus-vm production architecture

Figure 15.1, complete nexus-vm production architecture showing every external dependency and GCP service.

✅ Production checklist, nexus-vm integration

OS Login enabled, IAP TCP forwarding tested, plan to drop public IP
Dedicated nexus-runner SA attached, default Compute SA detached
Daily snapshot schedule on boot disk (14d retention)
Daily /opt/nexus tarball to GCS regional bucket with lifecycle
All app secrets in Secret Manager with 90d rotation cadence
Ops Agent installed, /opt/nexus/logs ingested into Cloud Logging
Uptime checks for morpheus.callbrightside.com and Context Harness :8765
Static IP 34.55.179.122 promoted/named, not ephemeral
Firewall rules: allow-ssh-iap, allow-https-cf, allow-internal, deny-all 65534
OS hardening: unattended-upgrades, fail2ban, root login off, password auth off
DR script tested quarterly, last test logged in MH
Monthly cost review against the $72 baseline

🎓 FOR NEW HIRE, the nexus-vm onboarding lap

Day 1: SSH to nexus-vm, cd /opt/nexus, run ls, git status, systemctl status nexus.service context-harness.service. Day 2: open morpheus.callbrightside.com and click around, run a Zeus search via the harness. Day 3: read this section end-to-end. Day 5: shadow Robert through the daily ops loop. Week 2: own the daily backup verification (does the GCS bucket have today's tarball). Week 3: own a non-critical change, write a Master History entry. Month 2: lead a DR drill end-to-end with Robert observing.

Appendices

Appendix A. gcloud CLI cheatsheet for single-VM ops

Action	Command
Configure account/project	`gcloud auth login` · `gcloud config set project bsp-prod` · `gcloud config set compute/zone us-central1-a`
List VMs	`gcloud compute instances list`
Describe nexus-vm	`gcloud compute instances describe nexus-vm --zone=us-central1-a`
SSH	`gcloud compute ssh nexus-vm --zone=us-central1-a [--tunnel-through-iap]`
Stop / start	`gcloud compute instances stop nexus-vm` · `gcloud compute instances start nexus-vm`
Resize	`gcloud compute instances set-machine-type nexus-vm --machine-type=n2-standard-4` (stopped VM)
Resize disk	`gcloud compute disks resize nexus-vm --size=100GB` then `resize2fs`
Snapshot	`gcloud compute disks snapshot nexus-vm --snapshot-names=manual-$(date +%Y%m%d)`
List snapshots	`gcloud compute snapshots list --filter="name~nexus"`
List firewall rules	`gcloud compute firewall-rules list`
Add firewall rule	`gcloud compute firewall-rules create NAME --allow=tcp:443 --source-ranges=...`
List addresses	`gcloud compute addresses list`
Reserve static IP	`gcloud compute addresses create NAME --addresses=IP --region=us-central1`
Read serial port	`gcloud compute instances get-serial-port-output nexus-vm`
List service accounts	`gcloud iam service-accounts list`
Get IAM policy on project	`gcloud projects get-iam-policy bsp-prod`
Add IAM binding	`gcloud projects add-iam-policy-binding bsp-prod --member=... --role=...`
List secrets	`gcloud secrets list`
Read latest secret	`gcloud secrets versions access latest --secret=NAME`
Add secret version	`echo -n "VAL" \| gcloud secrets versions add NAME --data-file=-`
Tail logs	`gcloud logging read 'resource.type="gce_instance"' --limit=50 --order=desc`
Stream logs	`gcloud alpha logging tail 'resource.type="gce_instance"'`
List buckets	`gcloud storage buckets list`
Copy to GCS	`gcloud storage cp file.tar.gz gs://bsp-nexus-backups/`
Download from GCS	`gcloud storage cp gs://bsp-nexus-backups/latest.tar.gz .`
List enabled APIs	`gcloud services list --enabled`
Run a connectivity test	`gcloud network-management connectivity-tests create ...`
Show quotas	`gcloud compute regions describe us-central1 --format='value(quotas)'`
Billing info	`gcloud billing projects describe bsp-prod`

Appendix B. IAM roles → permissions matrix (single-VM relevant)

Role	Key permissions	Use
roles/compute.osLogin	compute.instances.osLogin	SSH as a regular user via OS Login
roles/compute.osAdminLogin	compute.instances.osAdminLogin	SSH as sudo via OS Login
roles/iap.tunnelResourceAccessor	iap.tunnelInstances.accessViaIAP	SSH through IAP tunnel
roles/compute.instanceAdmin.v1	compute.instances.* (start, stop, delete, set-machine-type)	Manage VM lifecycle
roles/compute.storageAdmin	compute.disks., compute.snapshots.	Disks and snapshots
roles/compute.networkAdmin	compute.networks., compute.firewalls., compute.routers.*	VPC and firewalls
roles/storage.objectViewer	storage.objects.get, list	Read GCS objects
roles/storage.objectAdmin	storage.objects.*	Read/write GCS objects (bucket-scope)
roles/storage.admin	storage.* (incl. buckets)	Bucket admin, dangerous in prod
roles/secretmanager.secretAccessor	secretmanager.versions.access	Read latest/specific version
roles/secretmanager.secretVersionManager	secretmanager.versions.add, disable	Rotate secrets
roles/secretmanager.admin	secretmanager.*	Create/delete secrets
roles/cloudkms.cryptoKeyEncrypterDecrypter	cloudkms.cryptoKeyVersions.useToEncrypt/Decrypt	Use a key
roles/logging.logWriter	logging.logEntries.create	Write log entries
roles/logging.viewer	logging.logEntries.list	Read logs
roles/monitoring.metricWriter	monitoring.timeSeries.create	Write custom metrics
roles/monitoring.viewer	monitoring.* read	View dashboards
roles/monitoring.editor	monitoring.* write	Edit dashboards, alerts
roles/cloudtrace.agent	cloudtrace.traces.patch	Send traces
roles/errorreporting.writer	errorreporting.errorEvents.create	Send error events
roles/cloudsql.client	cloudsql.instances.connect	Connect through Cloud SQL Auth Proxy
roles/cloudbuild.builds.editor	cloudbuild.builds.*	Run Cloud Build
roles/iam.serviceAccountTokenCreator	iam.serviceAccounts.signBlob, getAccessToken	Sign on behalf of an SA
roles/iam.workloadIdentityUser	iam.serviceAccounts.getOpenIdToken	WIF target binding
roles/run.invoker	run.routes.invoke	Call a private Cloud Run service
roles/owner	everything	Avoid in production
roles/editor	almost everything except IAM	Avoid in production
roles/viewer	read most resources	OK for read-only humans

Appendix C. Troubleshooting decision trees

C.1 VM unreachable via SSH

Figure C.1, "VM unreachable via SSH" decision tree.

C.2 API returning 403

Figure C.2, "API returning 403" decision tree.

C.3 Costs spiking

Figure C.3, "Costs spiking" decision tree.

C.4 Logs not appearing

Figure C.4, "Logs not appearing" decision tree.

Appendix D. Cost calculator examples (single-VM scenarios)

D.1 Baseline nexus-vm, n2-standard-2, 50 GB pd-balanced, 24/7

Component	Quantity	Unit price	Monthly
n2 vCPU (us-central1)	2 x 730 hr	$0.0317/hr	$46.30
n2 RAM (us-central1)	8 GiB x 730 hr	$0.00425/hr	$24.83
SUD ~10% (auto)	applied	-	-$7.10
pd-balanced	50 GB	$0.10/GB-mo	$5.00
Static IP (in-use)	730 hr	free while in-use	$0.00
Egress (light)	10 GB	$0.085/GB	$0.85
GCS backup (Standard)	50 GB	$0.020/GB-mo	$1.00
Snapshot storage	~30 GB	$0.026/GB-mo	$0.78
Logging	5 GiB ingest	free under 50 GiB	$0.00
Secret Manager	10 active, 1k ops	~$0.06/secret-mo	$0.60
Subtotal			$72.26

D.2 Baseline + 1-yr CUD on N2 (2 vCPU + 8 GiB)

Component	Effect	Delta
1-yr CUD on n2 vCPU + RAM	~37% off	-$22.70
SUD does not stack	replace SUD	+$7.10
Adjusted total		~$56.66/mo

D.3 Baseline + Cloud SQL for WP (HA, db-custom-2-8192, 100 GB)

Component	Monthly add
Cloud SQL HA, 2 vCPU + 8 GiB	~$190
Storage 100 GB SSD	~$17
Backups (auto)	~$5
Adjusted total	~$284/mo

D.4 Baseline + Global Application Load Balancer

Component	Monthly add
Forwarding rule (1)	~$18 (+ data processing)
Cloud Armor base + WAF rules	~$5/policy + per-request
Egress through LB	$0.012/GB additional + standard egress
Adjusted total	~$95-110/mo

Appendix E. Glossary

ADC: Application Default Credentials, the discovery order Google client libraries follow.
API: Application Programming Interface, here the HTTP/gRPC service endpoint Google exposes.
Artifact Registry: Google's package and container image repository, successor to Container Registry.
BQ: BigQuery, Google's serverless data warehouse.
CDN: Content Delivery Network, here Cloud CDN or Cloudflare.
CEL: Common Expression Language, used in IAM conditions and org policy custom constraints.
CIDR: Classless Inter-Domain Routing, an IP range like 10.10.0.0/24.
CMEK: Customer-Managed Encryption Key, key in your Cloud KMS used to encrypt a resource.
Cloud Armor: Google's WAF + DDoS protection for the Application Load Balancer.
Cloud Build: Hosted CI service.
Cloud Run: Serverless container service, scales 0 to N.
Cloud Shell: Browser-based Linux shell pre-loaded with gcloud.
CSEK: Customer-Supplied Encryption Key, raw key bytes per request, mostly deprecated.
CUD: Committed Use Discount, 1- or 3-year commitment for compute pricing.
DLP: Data Loss Prevention, now Sensitive Data Protection.
DR: Disaster Recovery, the practice of rebuilding after major failure.
Eventarc: Event router that bridges audit logs and Pub/Sub into Cloud Run.
GA: Generally Available, the highest stability level for a Google product.
GCE: Google Compute Engine, the IaaS VM service.
GCS: Google Cloud Storage, the object store.
GKE: Google Kubernetes Engine, managed K8s.
HA: High Availability, here a regional Cloud SQL configuration with synchronous standby.
HCL: HashiCorp Configuration Language, the syntax of Terraform.
HSM: Hardware Security Module, dedicated cryptographic hardware.
IAM: Identity and Access Management.
IAP: Identity-Aware Proxy, fronts VMs and apps with Google identity auth.
IaC: Infrastructure as Code, e.g. Terraform.
KMS: Key Management Service.
LB: Load Balancer.
MIG: Managed Instance Group, an autoscaled cluster of identical VMs.
MQL: Monitoring Query Language, advanced query syntax for Cloud Monitoring.
NCC: Network Connectivity Center, hub-and-spoke management for VPC and hybrid.
NIC: Network Interface Controller. Also Network Intelligence Center.
OS Login: SSH access tied to Google identity, IAM-controlled.
OWASP: Open Worldwide Application Security Project, source of common rule sets.
PD: Persistent Disk, the older block storage family. Hyperdisk is the new family.
PGA: Private Google Access, lets a private VM reach googleapis.com via Google's backbone.
PITR: Point-In-Time Recovery, restore to any second within retention window.
PSC: Private Service Connect, attaches a managed service at a private IP inside your VPC.
RAG: Retrieval-Augmented Generation, here the Zeus index of 19,679 chunks.
RPO: Recovery Point Objective, the maximum data loss tolerated.
RTO: Recovery Time Objective, the maximum downtime tolerated.
SA: Service Account, a Google identity for software.
SCC: Security Command Center, GCP's posture and findings dashboard.
SLI: Service Level Indicator, the metric.
SLO: Service Level Objective, the target.
SSO: Single Sign-On.
SSRF: Server-Side Request Forgery, where a server is tricked into fetching attacker-chosen URLs.
SUD: Sustained Use Discount, automatic discount for monthly compute usage.
TF: Terraform.
UBLA: Uniform Bucket-Level Access, IAM-only access control on a GCS bucket.
VPC: Virtual Private Cloud, the global software-defined network.
VPC-SC: VPC Service Controls, a security perimeter around managed services.
WAF: Web Application Firewall.
WIF: Workload Identity Federation, keyless auth from outside-GCP workloads.

Appendix F. Quick reference card

Project: bsp-prod · Region: us-central1 · Zone: us-central1-a · VM: nexus-vm · IP: 34.55.179.122

SSH today: ssh -i ~/.ssh/google_compute_engine dovew@34.55.179.122 · SSH bulletproof: gcloud compute ssh nexus-vm --zone=us-central1-a --tunnel-through-iap

Daily ops loop: systemctl status nexus.service context-harness.service · journalctl -u nexus.service -n 100 · df -h · free -m

Backup verification: gcloud storage ls gs://bsp-nexus-backups/daily/ | tail -3 · gcloud compute snapshots list --filter="name~nexus-daily" --sort-by=~creationTimestamp --limit=3

Read a secret: gcloud secrets versions access latest --secret=NAME

Tail logs: gcloud logging read 'resource.type="gce_instance"' --limit=50 --order=desc --format="value(timestamp,severity,jsonPayload.message)"

Cost dashboard: Console → Billing → Reports, group by SKU · Status: status.cloud.google.com

Incident first 5 minutes: (1) confirm symptom, (2) check status page, (3) gcloud compute instances describe nexus-vm, (4) Logs Explorer severity>=ERROR, (5) Connectivity Test from Cloudflare CIDR.

DR command: see Section 15.12 fast rebuild script.

Bulletproof rule: never the fast option, always best practice. Read first, build second. Receipts not narration.