GCP Architecture Reference
A field reference for operating the Bright Side Plumbing nexus-vm production stack on Google Cloud, plus an onboarding map for a new junior web developer joining the team. The doc treats Google Cloud as the operating environment, calls out every place the BSP stack actually touches GCP, and flags the gotchas that bite single-VM workloads. No em dashes anywhere, dark mode by default, and every section closes with a production checklist.
Table of contents
🎓 FOR NEW HIRE, How to read this doc
Welcome. The fastest path to productivity is: skim sections 1, 2, 3, 6, and 15. Sections 1 and 15 cover the actual VM you will SSH into. Section 2 explains how Google decides whether your account or service account is allowed to do something. Section 3 explains why traffic from a browser actually reaches the VM. Section 6 is how we know anything is broken. Section 15 ties it together for our specific stack. Everything else is reference, dive in when you need it. Cloud is mostly Python, with shell glue and the gcloud CLI; Go is the language Google itself uses to build the platform; TypeScript shows up at the edges (Cloudflare Workers, Next.js, Bricks builder). Lean Python first.
🏗️ 1. Compute Engine HIGH PRIORITY
Compute Engine (GCE) is Google Cloud's IaaS layer. Our entire Nexus operational stack runs on a single GCE VM named nexus-vm at external IP 34.55.179.122. Everything in this section is calibrated for single-VM operations. Multi-VM, MIG, and regional patterns are summarized so you can recognize them, not deeply rehearsed.
1.1 The mental model: VM lifecycle and where state lives
A GCE VM is the composition of three independent objects: an instance (CPU/RAM/network attachment), one or more persistent disks (block storage that survives the instance), and a project + zone binding that scopes everything else (firewall rules, IAM, billing). When you "stop" a VM you keep the disks, lose the running RAM, and stop paying for vCPU/RAM but keep paying for disks and reserved static IPs. When you "delete" the instance you can choose to keep or delete each attached disk. Snapshots are the durable backup unit, they live in GCS-backed regional or multi-regional storage and are independent of the disk.
States in Compute Engine: PROVISIONING → STAGING → RUNNING → STOPPING → TERMINATED. There is also SUSPENDING/SUSPENDED for the suspend-to-disk flow which preserves RAM contents on a separate disk. Source: cloud.google.com/compute/docs/instances/instance-life-cycle.
⚠️ Gotcha, "stopped" still costs money
A stopped VM costs $0 for compute but you still pay for: attached persistent disks, attached GPUs that are reserved, reserved static external IPs (a static IP unattached to a running VM costs ~$0.005/hr, around $3.65/mo per address), and any committed-use discounts you bought. The "I'll stop the VM over the weekend to save money" play only works if you also detach unused IPs and sized disks correctly.
⚠️ Gotcha, regional vs zonal scoping
VMs and persistent disks are zonal. If us-central1-a goes down, your VM and its zonal PDs are inaccessible. Snapshots and images are multi-regional. Static external IPs are regional. Plan with this in mind: a snapshot can rebuild your VM in any zone, but a zonal disk cannot be cross-zone-attached, you must clone via snapshot. Source: cloud.google.com/compute/docs/regions-zones.
⚠️ Gotcha, instance metadata survives stop/start, not delete
Custom instance metadata (startup-script, ssh-keys, user data) lives on the instance object, not the disk. If you delete and recreate the instance reusing the same disk, the metadata is gone. Capture metadata before destructive operations: gcloud compute instances describe nexus-vm --zone us-central1-a --format='value(metadata)'.
1.2 Machine types: families, sizing, and what to pick
Machine types are grouped into families by workload pattern. The family determines the CPU platform, memory ratio, network bandwidth, and pricing curve.
| Family | Series | Workload fit | vCPU range | Mem/vCPU (GB) | Notes |
|---|---|---|---|---|---|
| General purpose | E2 | Cheap, web/dev, low constant load | 2-32 | 0.5-8 | Shared-core (e2-micro/small/medium), CPU platform abstracted |
| General purpose | N2, N2D | Balanced, most production workloads | 2-128 | 0.5-8 | N2 = Intel, N2D = AMD EPYC |
| General purpose | N4 | 2024 GA, Granite Rapids, Hyperdisk-only | 2-80 | 2-4 | Replaces N2 for new builds, but Hyperdisk only |
| General purpose | C3, C3D | Consistent high throughput | 4-176 | 2-8 | Sapphire/Genoa, Titanium NIC, Hyperdisk |
| General purpose | C4, C4A | 2024-25 GA, Emerald Rapids / Axion (Arm) | 2-192 | 2-8 | C4A is Google Axion Arm CPU |
| Compute optimized | C2, C2D, H3 | HPC, gaming servers, single-thread heavy | 4-360 | 2-8 | H3 is HPC-tuned, no live migration |
| Memory optimized | M1, M2, M3, X4 | SAP HANA, in-memory DBs | 40-1920 | 14-30 | X4 = bare metal up to 32 TB RAM |
| Storage optimized | Z3 | Local NVMe-heavy, OLAP, search | 88-176 | 8 | Up to 36 TB Local SSD |
| Accelerator optimized | A2, A3, G2 | GPU/ML, video transcoding | 12-208 | varies | A2=A100, A3=H100/H200, G2=L4 |
For single-VM operations like nexus-vm, the practical universe is E2, N2/N2D, C3. E2 if you want maximum cost efficiency and your workload is bursty. N2 if you want predictable performance with broad disk type support. C3 if you need consistent high throughput, but be aware C3 forces you onto Hyperdisk Balanced, which has different pricing than PD-Balanced.
📝 Code, list machine types in our zone
gcloud compute machine-types list \
--filter="zone:us-central1-a AND name~'^(e2|n2)-'" \
--format="table(name,guestCpus,memoryMb,maximumPersistentDisksSizeGb)" \
--sort-by="guestCpus,memoryMb"
⚠️ Gotcha, E2 shared-core is fine until it isn't
e2-micro / e2-small / e2-medium share a physical core with other tenants and use a CPU credit bucket. If your workload sustains above the burst baseline (25% for micro, 50% for small, 100% baseline only on medium when credits are full), it gets throttled. For nexus-vm running Python automation that occasionally spikes to do bulk embeddings, an E2 shared-core is the wrong move. Use e2-standard-2 at minimum, or N2 for predictable scheduling.
1.3 Custom machine types and sustained/committed use discounts
N1, N2, N2D, and E2 families allow custom CPU/RAM ratios. You pay vCPU and RAM independently. Useful when your workload wants 4 vCPU and 32 GB, which is twice the memory of n2-standard-4 (16 GB) and half that of n2-highmem-4 (32 GB). The custom path lets you land on exactly the right shape without overprovisioning.
Two automatic discount programs apply with no opt-in needed:
- Sustained use discount (SUD), applied automatically each month, lowers the price per vCPU/RAM as the instance runs for more of the month. Up to 30% off list for N1; varies for N2 (different curve, applied as inferred discount). Source: cloud.google.com/compute/docs/sustained-use-discounts.
- Committed use discount (CUD), opt-in, you commit 1 or 3 years to a region for a vCPU/RAM amount. 37% off for 1 year, ~55% off for 3 year (resource-based CUDs). Spend-based CUDs are also available for some products. CUDs apply across instances in the region, not tied to a single VM. Source: cloud.google.com/docs/cuds.
💡 Insight, CUDs for a single VM are still worth it
Even for a single nexus-vm, a 1-year resource-based CUD on the exact vCPU/RAM count typically pays back inside ~7 months. The risk is being locked into a region and having to keep paying if you move clouds. Numbers in Section 12 (Cost).
1.4 Disk types: PD, Hyperdisk, Local SSD
| Disk | Backing | Max IOPS/disk | Max throughput | Capacity | Use |
|---|---|---|---|---|---|
| pd-standard | Spinning HDD | ~7,500 r / 15,000 w | ~1.2 GB/s | 10 GB-64 TB | Cheap archive volumes, batch jobs |
| pd-balanced | SSD, mid tier | 15,000-80,000 | 240-1,200 MB/s | 10 GB-64 TB | Default for most VMs, good cost/perf |
| pd-ssd | SSD, premium | 15,000-100,000 | 240-1,200 MB/s | 10 GB-64 TB | Latency-sensitive DB workloads |
| pd-extreme | SSD, provisioned IOPS | up to 120,000 | 2,200 MB/s | 500 GB-64 TB | Predictable extreme IOPS, high cost |
| hyperdisk-balanced | SSD, decoupled IOPS+capacity+throughput | up to 350,000 | 5,000 MB/s | 4 GB-64 TB | Required on N4/C3/C4, future default |
| hyperdisk-extreme | SSD | up to 500,000 | 10,000 MB/s | 64 GB-64 TB | SAP HANA, high-end DBs |
| hyperdisk-throughput | HDD-priced, throughput-tuned | low | up to 600 MB/s | 2 TB-32 TB | Big sequential reads, log archives |
| Local SSD | NVMe attached to host | up to 9M (aggregate) | tens of GB/s | 375 GB increments | Scratch/cache, ephemeral, lost on stop |
🔥 Recency, PD-Standard sunset path
Google has been steering customers off pd-standard. New machine families (N4, C4) do not allow it. For nexus-vm on N2 the option still exists, but pd-balanced is the default for new boot disks and it is rare for the cost difference to justify pd-standard. Source: cloud.google.com/compute/docs/disks.
⚠️ Gotcha, performance scales with disk size
For pd-balanced and pd-ssd, IOPS and throughput scale with capacity. A 100 GB pd-balanced disk caps at ~3,000 IOPS no matter what your VM is. If your DB feels slow, oversize the disk. The size-IOPS curve is documented at cloud.google.com/compute/docs/disks/performance.
⚠️ Gotcha, Local SSD is volatile
Local SSD is physically attached to the host machine. If the VM is stopped, terminated, live-migrated, or the host fails, the data is gone. Use Local SSD only for: ephemeral cache, RAID arrays where the application replicates data elsewhere, and scratch space for batch jobs. Never put a primary database on Local SSD without external replication.
1.5 OS images, Shielded VMs, Confidential VMs
Google maintains a catalog of public images: Debian (default for many tutorials), Ubuntu LTS (12, 14, 16, 18, 20, 22, 24), Rocky Linux 8/9, RHEL 7/8/9, CentOS Stream, SUSE, Windows Server (2016, 2019, 2022, 2025). Each project also gets a private image catalog for custom images you build with Packer or gcloud compute images create. The nexus-vm currently runs Debian (verify with cat /etc/os-release).
📝 Code, list latest Ubuntu 24.04 LTS images
gcloud compute images list \
--project=ubuntu-os-cloud \
--filter="family:ubuntu-2404-lts AND status=READY" \
--sort-by=~creationTimestamp \
--limit=3
Shielded VM adds three layers: secure boot (UEFI verifies signed firmware), virtual TPM (vTPM for measured boot, attestation), and integrity monitoring (each boot is checksummed and the dashboard shows drift). On by default for newer Google-published images. Costs nothing extra. Source: cloud.google.com/security/shielded-cloud/shielded-vm.
Confidential VM goes further: memory is encrypted in use using AMD SEV (N2D, C2D), AMD SEV-SNP (C3D), Intel TDX (C3), or NVIDIA H100 GPU memory protection (A3). Adds ~5-10% perf overhead on most workloads. Required for processing strongly regulated data on shared infrastructure. Source: cloud.google.com/confidential-computing.
⚠️ Gotcha, custom images and the kernel surprise
If you build a custom image from a Debian VM and apply it to a new VM, you may inherit a kernel pinned to the source machine type. When you create the new VM with a different machine family, the guest tools may fail to detect the new NIC or NVMe driver. The fix is to install google-osconfig-agent, google-cloud-sdk, and the google-compute-engine guest environment package before imaging. Or use gcloud compute images import which automates the conversion.
1.6 Metadata service: 169.254.169.254
Every GCE VM has a magic link-local IP 169.254.169.254 that serves project and instance metadata over HTTP. This is the same pattern as AWS EC2 IMDS but with key differences. Google's metadata service requires the header Metadata-Flavor: Google on every request, which prevents accidental exposure if a web server proxies untrusted user input.
📝 Code, metadata service queries from inside the VM
# Project ID
curl -H "Metadata-Flavor: Google" \
http://metadata.google.internal/computeMetadata/v1/project/project-id
# Default service account email
curl -H "Metadata-Flavor: Google" \
http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/email
# Default SA OAuth access token (auto-refreshed by Google)
curl -H "Metadata-Flavor: Google" \
http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token
# SSH keys configured on the project (NOT used if OS Login is on)
curl -H "Metadata-Flavor: Google" \
http://metadata.google.internal/computeMetadata/v1/project/attributes/ssh-keys
# Instance ID, zone, machine type
curl -H "Metadata-Flavor: Google" \
http://metadata.google.internal/computeMetadata/v1/instance/?recursive=true&alt=json | jq
⚠️ Gotcha, SSRF and the metadata service
If your application proxies arbitrary URLs and runs on a GCE VM, you have an SSRF vector to 169.254.169.254. The required Metadata-Flavor: Google header used to be only required for v1 endpoints; it is now mandatory for all responses. But also lock down the egress URL allowlist or block link-local IPs at the application layer.
1.7 Startup scripts and shutdown scripts
Two metadata keys give you boot- and shutdown-time hooks: startup-script (or startup-script-url pointing to GCS) and shutdown-script. Startup scripts run as root every time the VM boots. Shutdown scripts run on a graceful stop with a 90-second timeout, after which the VM is force-stopped. Output goes to the serial console (gcloud compute instances get-serial-port-output nexus-vm) and to journalctl -u google-startup-scripts.service.
📝 Code, set a startup script that installs Ops Agent
gcloud compute instances add-metadata nexus-vm \
--zone=us-central1-a \
--metadata=startup-script='#!/bin/bash
set -euo pipefail
if ! systemctl is-active --quiet google-cloud-ops-agent; then
curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
bash add-google-cloud-ops-agent-repo.sh --also-install
fi
systemctl enable google-cloud-ops-agent
systemctl start google-cloud-ops-agent'
⚠️ Gotcha, startup scripts run on every boot
Including when you stop and start the VM, including after a live migration in some cases. Make every startup script idempotent. The pattern systemctl is-active --quiet X || install_X is your friend. Do not put one-time bootstrap (project creation, DB init) in a startup script unless you guard it with a sentinel file.
1.8 Live migration, spot, preemptible
By default, Google migrates your VM live to another host for maintenance with no downtime (typically <1 second pause). The default availability policy onHostMaintenance=MIGRATE is correct for production. The other option is TERMINATE, which is used for GPU/TPU machines and Spot VMs.
Spot VMs (the modern name) and Preemptible VMs (the legacy name, capped at 24 hours) are deeply discounted (60-91% off) compute that Google can reclaim with 30 seconds notice. Use for: stateless batch, fault-tolerant queues, CI runners. Do not use for: a single VM hosting your only production stack. Source: cloud.google.com/compute/docs/instances/spot.
🔥 Recency, preemptible VMs are deprecated for new use
The 24-hour-capped legacy preemptible VMs are still functional but Google steers everyone to Spot VMs (no time cap, more flexible reclaim contract). New automation should set --provisioning-model=SPOT and not --preemptible. Source: cloud.google.com/compute/docs/instances/preemptible.
1.9 SSH access methods (HIGH PRIORITY)
This is the section you will reread most. Three independent ways to SSH to a GCE VM, with very different security postures.
1.9.1 Method A, OS Login
Recommended default. SSH keys are tied to your Google identity (the email you log into the Cloud Console with), enforced via the roles/compute.osLogin or roles/compute.osAdminLogin IAM roles. Keys are pushed to the VM by the OS Login agent on boot. Revoking access is instant: remove the IAM binding and the key is removed on next sync.
📝 Code, enable OS Login at the project level
# Project-wide
gcloud compute project-info add-metadata \
--metadata enable-oslogin=TRUE
# Per-instance override
gcloud compute instances add-metadata nexus-vm \
--zone=us-central1-a \
--metadata enable-oslogin=TRUE
# Grant SSH access (regular user)
gcloud projects add-iam-policy-binding PROJECT_ID \
--member="user:robert.dove@callbrightside.com" \
--role="roles/compute.osLogin"
# Grant sudo access
gcloud projects add-iam-policy-binding PROJECT_ID \
--member="user:robert.dove@callbrightside.com" \
--role="roles/compute.osAdminLogin"
1.9.2 Method B, Metadata SSH keys (legacy)
The traditional path. Each VM (or the project) carries an ssh-keys metadata entry containing public keys with a username:ssh-rsa AAAA... format. The Google Compute Engine guest agent picks these up and writes them into /home/<user>/.ssh/authorized_keys on each boot. This is what ~/.ssh/google_compute_engine is wired up to.
📝 Code, the BSP standard SSH path
# From Robert's local machine
ssh -i ~/.ssh/google_compute_engine dovew@34.55.179.122
# Or via gcloud (auto-handles keys, OS Login if enabled, IAP otherwise)
gcloud compute ssh nexus-vm --zone=us-central1-a
# With command, no shell
gcloud compute ssh nexus-vm --zone=us-central1-a --command="uptime"
⚠️ Gotcha, OS Login and metadata keys conflict
If enable-oslogin=TRUE is set, the metadata ssh-keys entry is ignored. You can be locked out if you flip OS Login on without granting yourself the OS Login IAM role. Always grant the role first, verify SSH works, then enable OS Login.
1.9.3 Method C, IAP TCP forwarding
Identity-Aware Proxy TCP forwarding tunnels SSH (and other TCP) through Google's IAP fabric to a VM that has no public IP. The connection authenticates as your Google identity, and the VM's firewall only needs to allow port 22 from the IAP range 35.235.240.0/20. This is the path to a fully private VM that no one on the internet can reach. Source: cloud.google.com/iap/docs/using-tcp-forwarding.
📝 Code, SSH via IAP (no public IP needed)
# gcloud handles the tunnel automatically
gcloud compute ssh nexus-vm --zone=us-central1-a --tunnel-through-iap
# Required IAM
gcloud projects add-iam-policy-binding PROJECT_ID \
--member="user:robert.dove@callbrightside.com" \
--role="roles/iap.tunnelResourceAccessor"
# Required firewall rule
gcloud compute firewall-rules create allow-ssh-from-iap \
--direction=INGRESS --action=ALLOW \
--rules=tcp:22 --source-ranges=35.235.240.0/20
💡 Insight, the right answer for nexus-vm
Today, nexus-vm is reachable on a public IP 34.55.179.122 with metadata SSH keys. The bulletproof path is OS Login + IAP TCP forwarding + no public IP. The morpheus.callbrightside.com command center can still serve traffic via a load balancer with the VM as the backend. Section 15 walks through the migration.
1.10 Instance groups, autoscalers, and managed templates
For multi-VM patterns: a managed instance group (MIG) spawns identical VMs from an instance template, can autoscale on CPU, custom metric, or schedule, and integrates with backend services for load balancing. A regional MIG spreads VMs across all zones in a region for HA. We don't use MIGs today on nexus-vm but if BSP grows out of single-VM, the migration path is: snapshot the disk, build an instance template from the snapshot, define a MIG with size 1 first, then scale.
📝 Code, build an instance template from current nexus-vm
# Step 1: snapshot the boot disk
gcloud compute disks snapshot nexus-vm \
--zone=us-central1-a \
--snapshot-names=nexus-vm-template-$(date +%Y%m%d)
# Step 2: build a custom image
gcloud compute images create nexus-vm-image-v1 \
--source-snapshot=nexus-vm-template-$(date +%Y%m%d) \
--family=nexus-vm
# Step 3: create the template
gcloud compute instance-templates create nexus-vm-tpl-v1 \
--machine-type=n2-standard-2 \
--image-family=nexus-vm \
--image-project=PROJECT_ID \
--tags=http-server,https-server
1.11 gcloud compute reference (single-VM ops)
| Operation | Command |
|---|---|
| Describe nexus-vm | gcloud compute instances describe nexus-vm --zone=us-central1-a |
| Stop / start | gcloud compute instances stop nexus-vm --zone=us-central1-a |
| Resize machine type | gcloud compute instances set-machine-type nexus-vm --machine-type=n2-standard-4 --zone=us-central1-a (VM must be stopped) |
| Resize boot disk | gcloud compute disks resize nexus-vm --size=100GB --zone=us-central1-a then resize2fs in the guest |
| Add a new disk | gcloud compute disks create data-1 --size=200GB --type=pd-balanced --zone=us-central1-a then gcloud compute instances attach-disk nexus-vm --disk=data-1 --zone=us-central1-a |
| Snapshot | gcloud compute disks snapshot nexus-vm --zone=us-central1-a --snapshot-names=nexus-vm-$(date +%Y%m%d) |
| Reset (hard) | gcloud compute instances reset nexus-vm --zone=us-central1-a (last resort) |
| Serial console | gcloud compute instances get-serial-port-output nexus-vm --zone=us-central1-a |
| Force-detach static IP | gcloud compute addresses delete IP_NAME --region=us-central1 |
1.12 SVG: nexus-vm topology
✅ Production checklist, Compute Engine
- nexus-vm machine type sized to 95th percentile load + 25% headroom
- Boot disk type pd-balanced minimum, sized for required IOPS curve
- Daily snapshot schedule on the boot disk (Section 15.7)
- Static external IP reserved and named (so it survives stop/start)
- OS Login enabled OR ssh-keys metadata audited and pruned monthly
- Shielded VM features all on (secure boot, vTPM, integrity monitoring)
- Ops Agent installed and reporting (Section 6)
- Startup script idempotent and version-controlled
- onHostMaintenance=MIGRATE for production VMs
- No production workload on Local SSD without external replication
- Metadata access restricted, no SSRF surface in user-facing apps
🎓 FOR NEW HIRE, Compute Engine cheat lines
- "VM" = "instance" = "GCE VM" all mean the same thing.
gcloud compute instances listshows you everything we run.- To get to nexus-vm:
gcloud compute ssh nexus-vm --zone=us-central1-a. - Never run a destructive command (stop, delete, snapshot delete) without typing the VM name explicitly. The console autocompletes, that is dangerous.
- Python is the language we automate GCP from. The library is
google-cloud-compute. Install withpip install google-cloud-computeand read the SDK docs.
🔒 2. IAM, Service Accounts, Audit HIGH PRIORITY
IAM (Identity and Access Management) is how Google decides whether an identity (a user, a service account, or a Google group) is allowed to perform an action on a resource. Mastering IAM is the difference between secure, predictable infrastructure and a 3 a.m. incident root cause that reads "the default service account had Owner."
2.1 The hierarchy: organization, folder, project, resource
Resources sit in a four-level hierarchy. IAM bindings attached at any level inherit downward.
- Organization, the root, tied to a Cloud Identity or Workspace domain (callbrightside.com).
- Folder, optional grouping (dev, prod, sandbox, by team).
- Project, the billing and quota boundary, where most resources actually live.
- Resource, e.g. a GCS bucket, a GCE instance, a Cloud SQL DB, sometimes accepts its own bindings.
A binding is a 3-tuple (member, role, condition?). Members are the identity, roles are bundles of permissions, conditions are optional CEL expressions that gate the binding by request attributes. Source: cloud.google.com/iam/docs/overview.
⚠️ Gotcha, inheritance is additive only
IAM bindings add permissions as you go down the tree, never subtract. If you grant Owner at the Org level, you cannot revoke it at the project level. The only way to remove an inherited permission is to remove the higher binding or use a Deny policy (Org Policy + IAM Deny, see 2.10).
2.2 Service accounts, the identity for automation
A service account (SA) is a Google identity owned by a project, used by software (not humans) to authenticate. Email format: NAME@PROJECT_ID.iam.gserviceaccount.com. Every project gets several service accounts created automatically.
| Service account | Email pattern | Purpose |
|---|---|---|
| Compute Engine default SA | PROJECT_NUMBER-compute@developer.gserviceaccount.com | Identity assumed by GCE VMs unless overridden |
| App Engine default SA | PROJECT_ID@appspot.gserviceaccount.com | App Engine and Cloud Functions Gen 1 default |
| Google APIs SA | PROJECT_NUMBER@cloudservices.gserviceaccount.com | Used by GCP services to act on your behalf, e.g. Deployment Manager |
| Cloud Build SA | PROJECT_NUMBER@cloudbuild.gserviceaccount.com | Cloud Build runs builds as this identity |
| Cloud Run SA | Compute default unless overridden | Service identity for Cloud Run revisions |
| Pub/Sub SA | service-PROJECT_NUMBER@gcp-sa-pubsub.iam.gserviceaccount.com | Pub/Sub uses this for push delivery |
🔥 Recency, default SA permissions tightened
Before May 2024, the Compute Engine default SA was granted Editor (roles/editor) on the project at creation. As of 2024 organizations have an org policy iam.automaticIamGrantsForDefaultServiceAccounts that defaults to disabled, meaning new projects no longer auto-grant Editor. Verify with gcloud projects get-iam-policy PROJECT_ID. Source: cloud.google.com/iam/docs/service-account-overview.
2.3 Service account keys, the danger zone
You can mint a downloadable JSON key for a service account. This key is a long-lived bearer credential. If it leaks, the holder authenticates as the SA from anywhere on the internet until you rotate the key. Rules of the road:
- Avoid SA keys for any workload that can use Application Default Credentials, Workload Identity Federation, or attached SA on a GCE VM/Cloud Run service.
- If you must mint one, set an expiry, store it in Secret Manager, and rotate every 90 days.
- Org policy
iam.disableServiceAccountKeyCreationblocks key creation entirely. - Org policy
iam.disableServiceAccountKeyUploadblocks BYO key uploads.
📝 Code, list and rotate SA keys
# List keys for an SA
gcloud iam service-accounts keys list \
--iam-account=nexus-runner@PROJECT_ID.iam.gserviceaccount.com
# Create a new key (90-day expiry)
gcloud iam service-accounts keys create new-key.json \
--iam-account=nexus-runner@PROJECT_ID.iam.gserviceaccount.com
# Disable an old key (preferred over delete during rotation)
gcloud iam service-accounts keys disable KEY_ID \
--iam-account=nexus-runner@PROJECT_ID.iam.gserviceaccount.com
# Delete after the rollout is verified
gcloud iam service-accounts keys delete KEY_ID \
--iam-account=nexus-runner@PROJECT_ID.iam.gserviceaccount.com
2.4 Workload Identity Federation (the right answer)
WIF lets a workload outside GCP (GitHub Actions, AWS, Okta, anything that issues an OIDC or SAML token) impersonate a Google service account without ever touching a downloaded key. You configure a workload identity pool, define provider trust (issuer URL, audience, attribute mapping), and grant the external identity roles/iam.workloadIdentityUser on the target SA.
📝 Code, GitHub Actions to GCP without a key
# 1. Create the workload identity pool
gcloud iam workload-identity-pools create gh-pool \
--location=global \
--display-name="GitHub Actions"
# 2. Add the GitHub OIDC provider
gcloud iam workload-identity-pools providers create-oidc gh-provider \
--workload-identity-pool=gh-pool --location=global \
--issuer-uri="https://token.actions.githubusercontent.com" \
--attribute-mapping="google.subject=assertion.sub,attribute.repository=assertion.repository" \
--attribute-condition="assertion.repository=='callbrightside/nexus'"
# 3. Grant the SA's WorkloadIdentityUser role to the GitHub repo subject
gcloud iam service-accounts add-iam-policy-binding \
ci-deployer@PROJECT_ID.iam.gserviceaccount.com \
--role=roles/iam.workloadIdentityUser \
--member="principalSet://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/gh-pool/attribute.repository/callbrightside/nexus"
2.5 Role taxonomy: basic, predefined, custom
| Tier | Examples | Use |
|---|---|---|
| Basic (legacy) | roles/owner, roles/editor, roles/viewer | Avoid in production. Far too broad. |
| Predefined | roles/compute.instanceAdmin, roles/storage.objectViewer, roles/secretmanager.secretAccessor | Standard answer. Use these. |
| Custom | You define a list of permissions | For least-privilege when no predefined role fits |
📝 Code, build a custom role for the nexus runner
# nexus-runner.yaml
title: "Nexus Runner"
description: "Read GCS, write logs, no admin"
stage: GA
includedPermissions:
- storage.objects.get
- storage.objects.list
- logging.logEntries.create
- secretmanager.versions.access
gcloud iam roles create nexusRunner \
--project=PROJECT_ID \
--file=nexus-runner.yaml
2.6 Permissions you actually need on nexus-vm
| Action | Required permission(s) | Predefined role |
|---|---|---|
| SSH to nexus-vm via gcloud | compute.instances.get + iap.tunnelInstances.accessViaIAP | roles/compute.osLogin + roles/iap.tunnelResourceAccessor |
| Stop/start nexus-vm | compute.instances.stop, compute.instances.start | roles/compute.instanceAdmin.v1 |
| Read a GCS bucket | storage.objects.get, storage.objects.list | roles/storage.objectViewer |
| Read a Secret Manager value | secretmanager.versions.access | roles/secretmanager.secretAccessor |
| Write log entries | logging.logEntries.create | roles/logging.logWriter |
| Write metrics | monitoring.timeSeries.create | roles/monitoring.metricWriter |
| Snapshot a disk | compute.disks.createSnapshot | roles/compute.storageAdmin |
2.7 Conditional bindings (CEL)
You can attach a Common Expression Language condition to any binding. The binding only fires when the request matches.
📝 Code, restrict GCS read access to a specific bucket and time window
gcloud projects add-iam-policy-binding PROJECT_ID \
--member="user:auditor@callbrightside.com" \
--role="roles/storage.objectViewer" \
--condition='expression=resource.name.startsWith("projects/_/buckets/bsp-audit") && request.time < timestamp("2026-12-31T23:59:59Z"),title=audit-window-2026'
2.8 Domain-wide delegation
For Workspace customers, a service account can be granted the right to impersonate any user in the domain for specific OAuth scopes. Used by automation that needs to send mail as a user, read calendars across the org, etc. Configured in admin.google.com under Security → API Controls → Domain-wide Delegation. The SA's "client ID" (numeric, not the email) is whitelisted with a list of scopes. Powerful, dangerous, audit it.
2.9 Audit logs taxonomy
Cloud Audit Logs come in four streams:
- Admin Activity, always on, free, 400-day retention. Captures any API call that modifies config or metadata. Source of truth for "who turned off the firewall."
- Data Access, off by default for most services (BigQuery is the exception), can be expensive, captures read/write of data. Enable selectively per service.
- System Event, generated by Google, captures auto-actions like live migrations.
- Policy Denied, captures denied requests so you can debug missing IAM.
📝 Code, query audit logs for IAM changes
gcloud logging read \
'logName=~"cloudaudit.googleapis.com%2Factivity" AND protoPayload.serviceName="iam.googleapis.com"' \
--limit=20 --format="table(timestamp,protoPayload.authenticationInfo.principalEmail,protoPayload.methodName,protoPayload.resourceName)"
2.10 IAM Deny policies
2023+ feature. A Deny policy is the only way to prevent a permission regardless of inherited Allow bindings. Attaches at Org, Folder, or Project level. Useful for guardrails like "no one, not even Org Admins, can disable Audit Logging."
📝 Code, deny a specific permission for everyone except a break-glass group
# deny-policy.json
{
"displayName": "deny-disable-audit-logging",
"rules": [{
"deniedPrincipals": ["principalSet://goog/public:all"],
"exceptionPrincipals": ["principalSet://goog/group/break-glass@callbrightside.com"],
"deniedPermissions": ["logging.googleapis.com/sinks.delete"]
}]
}
gcloud iam policies create deny-disable-audit \
--kind=denypolicies \
--policy-file=deny-policy.json \
--attachment-point=cloudresourcemanager.googleapis.com/organizations/ORG_ID
2.11 Troubleshooter, Policy Analyzer, Policy Simulator
- IAM Troubleshooter, "why can user X not do action Y on resource Z" wizard. Console → IAM → Troubleshoot.
- Policy Analyzer, run queries like "list all bindings that grant any permission on resource Z." Useful for compliance audits.
- Policy Simulator, simulate a policy change before applying it. Tells you which historical requests would have been allowed/denied differently.
✅ Production checklist, IAM
- No human user has roles/owner or roles/editor on the production project
- Default Compute SA does not have Editor (verify org policy)
- SA keys forbidden by org policy or rotated <90 days
- External CI uses Workload Identity Federation, not downloaded keys
- All bindings reviewed quarterly with Policy Analyzer
- Data Access logs on for high-value services (Secret Manager, GCS audit bucket)
- Deny policy guarding Audit Logs and IAM-modifying permissions
- Custom roles preferred over basic roles for least privilege
🎓 FOR NEW HIRE, IAM mental model
Every API call to GCP gets stamped "who is asking, what are they trying to do, on what resource." IAM is the lookup table that returns yes or no. The fastest debugging path when you get a 403: copy the exact permission string from the error, search the role catalog (cloud.google.com/iam/docs/understanding-roles), and grant the smallest predefined role that contains it. Never grant Owner to fix something. Once you do, you cannot tell what permission was actually missing.
🌐 3. Networking, VPC, Load Balancing HIGH PRIORITY
Every byte that reaches nexus-vm traverses a chain of network primitives. Understanding the chain is what turns "the site is down" from a 30-minute incident into a 2-minute fix.
3.1 VPC architecture
A VPC (Virtual Private Cloud) is a global, software-defined network. Unlike AWS, where each VPC is regional, Google's VPC spans every region. A subnet, however, is regional. So a single VPC named default typically has one auto-mode subnet per region, each with a non-overlapping CIDR.
VPCs come in two modes:
- Auto mode, Google manages a /20 subnet per region in the
10.128.0.0/9range. Convenient for sandboxes. - Custom mode, you create subnets explicitly with chosen CIDRs. Required for any production workload.
📝 Code, create a custom-mode VPC and a subnet
gcloud compute networks create bsp-prod-vpc --subnet-mode=custom
gcloud compute networks subnets create nexus-subnet \
--network=bsp-prod-vpc \
--region=us-central1 \
--range=10.10.0.0/24 \
--enable-private-ip-google-access \
--enable-flow-logs
3.2 Subnets, secondary ranges, alias IPs
A subnet has a primary IPv4 range used for VM NICs, plus optional secondary ranges used for GKE pod and service IPs (alias IP), or for assigning a /28 to a Cloud SQL Private Service Connection. Source: cloud.google.com/vpc/docs/subnets.
3.3 Firewall rules
VPC firewalls are stateful. A connection initiated from inside the VPC is allowed back in, you do not need a separate egress allow rule. Rules are scoped to a network, evaluated in priority order (lowest number wins, ties decide by allow-deny ordering, deny-wins).
| Field | Meaning |
|---|---|
| Direction | INGRESS or EGRESS |
| Action | ALLOW or DENY |
| Priority | 0-65535, lower = higher priority. Default 1000. |
| Source ranges | List of CIDRs (ingress only) |
| Source tags / SAs | Restrict to VMs with a network tag or running as an SA |
| Target tags / SAs | Apply only to VMs with a tag or SA |
| Protocols/ports | e.g. tcp:80,443, udp:53, all |
📝 Code, the BSP standard nexus-vm firewall posture
# SSH only from IAP
gcloud compute firewall-rules create allow-ssh-iap \
--network=bsp-prod-vpc --direction=INGRESS --action=ALLOW \
--rules=tcp:22 --source-ranges=35.235.240.0/20 \
--target-tags=ssh-iap
# HTTPS from Cloudflare only
gcloud compute firewall-rules create allow-https-cf \
--network=bsp-prod-vpc --direction=INGRESS --action=ALLOW \
--rules=tcp:443 --source-ranges=173.245.48.0/20,103.21.244.0/22,103.22.200.0/22,103.31.4.0/22,141.101.64.0/18,108.162.192.0/18,190.93.240.0/20,188.114.96.0/20,197.234.240.0/22,198.41.128.0/17,162.158.0.0/15,104.16.0.0/13,104.24.0.0/14,172.64.0.0/13,131.0.72.0/22 \
--target-tags=web
# Default deny all other ingress (priority 65534, hardened)
gcloud compute firewall-rules create deny-all-ingress \
--network=bsp-prod-vpc --direction=INGRESS --action=DENY \
--rules=all --source-ranges=0.0.0.0/0 --priority=65534
⚠️ Gotcha, default network is too permissive
Auto-created default network ships with default-allow-ssh, default-allow-rdp, default-allow-icmp, and default-allow-internal rules wide open to 0.0.0.0/0. Production should use a custom VPC with no default rules. If you must keep default, delete default-allow-ssh and default-allow-rdp immediately.
⚠️ Gotcha, network tags vs target service accounts
Network tags are unauthenticated metadata, anyone with compute.instances.setTags can add the web tag to any VM and inherit its firewall rules. Target service accounts require iam.serviceAccounts.actAs and are the secure default for production. Migrate from tags to SAs in the firewall rules.
3.4 Cloud NAT, Private Google Access, Private Service Connect
- Cloud NAT, gives outbound internet to VMs without external IPs. Regional, scaled by NAT IPs you provision (start with auto, watch for port exhaustion at >~64,000 outbound conns per IP).
- Private Google Access, lets a VM with no external IP reach googleapis.com endpoints (storage, logging, monitoring, secrets) over Google's backbone. Enable at the subnet level. Free.
- Private Service Connect (PSC), exposes a managed service (Cloud SQL, third-party SaaS) at a private IP inside your VPC. Hides the public endpoint entirely.
3.5 Cloud Load Balancing variants
| Type | Layer | Scope | Use |
|---|---|---|---|
| Global external Application LB | L7 HTTPS | Global anycast IP | Public web apps, multi-region failover |
| Regional external Application LB | L7 HTTPS | Regional | Single-region web with custom auth |
| Internal Application LB | L7 HTTPS | Regional, VPC-internal | Microservices inside the VPC |
| Global external Network LB | L4 TCP/UDP/SSL | Global anycast IP | Non-HTTP global, e.g. game servers |
| Regional external Network LB | L4 TCP/UDP | Regional | Pass-through, preserves source IP |
| Internal Network LB | L4 TCP/UDP | Regional, VPC-internal | Internal pass-through |
3.6 Cloud Armor
WAF for the global Application Load Balancer. Features: pre-configured OWASP rules, rate-limiting (per-IP per-minute thresholds), bot management, geo-based allow/deny (GeoIP), Adaptive Protection (ML-based DDoS), reCAPTCHA Enterprise integration. Enabled per backend service. The Cloudflare in front of nexus-vm handles much of this today, but if we move to a GCP load balancer, Cloud Armor takes over.
3.7 Cloud CDN
Edge caching tied to the global Application LB. Set --enable-cdn on a backend service. Cache keys default to host + path, customizable. Negative caching for 404s. Cache invalidation via gcloud compute url-maps invalidate-cdn-cache. Today Cloudflare is our CDN, Cloud CDN is the migration target if we leave Cloudflare.
3.8 Cloud DNS
Authoritative managed DNS. Two zone types: public (resolvable on the internet) and private (resolvable only inside designated VPCs). DNSSEC available, DNS forwarding for hybrid (on-prem to GCP). Today callbrightside.com DNS is on Cloudflare; Cloud DNS is the option if we centralize on Google.
3.9 VPC peering, Shared VPC, VPC-SC
- VPC Peering, point-to-point, non-transitive connection between two VPCs. CIDRs must not overlap. No bandwidth limit.
- Shared VPC, one host project owns the network, multiple service projects attach VMs. Centralizes networking ops.
- VPC Service Controls (VPC-SC), defines a security perimeter around services like GCS, BigQuery, Secret Manager. Even if an SA key leaks, it cannot exfiltrate data outside the perimeter.
3.10 Interconnect, VPN, Network Connectivity Center
Hybrid cloud options:
- Cloud VPN, IPsec tunnels, classic and HA flavors, ~3 Gbps per tunnel.
- Dedicated Interconnect, 10 Gbps or 100 Gbps physical circuit, requires a colocation provider.
- Partner Interconnect, 50 Mbps to 50 Gbps via a service provider.
- Cross-Cloud Interconnect, dedicated link to AWS, Azure, OCI, Alibaba.
- Network Connectivity Center (NCC), hub-and-spoke management for complex multi-VPC and hybrid topologies.
3.11 IAP for HTTPS
Beyond TCP forwarding (Section 1.9), IAP can sit in front of an HTTPS Load Balancer to add Google identity authentication on top of any backend (GCE, GKE, Cloud Run, App Engine). Set --enable-iap on the backend, the LB returns a Google sign-in flow before the request reaches the backend. The signed JWT is forwarded as X-Goog-IAP-JWT-Assertion for the backend to verify.
3.12 Static and ephemeral IPs, IP forwarding
- Ephemeral external IP, attached at instance create time, lost on instance delete (or stop, if not promoted).
- Static external IP, regional resource you reserve and attach. Survives instance lifecycle. Verify
nexus-vmuses one. - IP forwarding, allow a VM to act as a NAT/router by setting
canIpForward=true. Required for software defined gateways.
📝 Code, verify nexus-vm has a static IP
gcloud compute addresses list --filter="address=34.55.179.122"
# If empty, the IP is ephemeral. Promote it:
gcloud compute addresses create nexus-vm-static \
--addresses=34.55.179.122 \
--region=us-central1
3.13 Network telemetry: VPC Flow Logs, Mirror, Intelligence Center
- VPC Flow Logs, sampled connection records exported to Cloud Logging. Enable per-subnet. Cost scales with sample rate.
- Packet Mirror, full pcap of selected traffic for forensics or IDS appliances.
- Network Intelligence Center, dashboards for topology, connectivity tests, performance, firewall insights, and routes.
✅ Production checklist, Networking
- Custom-mode VPC, no default network in production
- Default-deny ingress firewall at priority 65534
- Service-account-based firewall targets (not network tags) for production
- Cloud NAT for outbound from any VM without an external IP
- Private Google Access on every subnet hosting workload VMs
- VPC Flow Logs on with reasonable sampling (5-10%)
- Static external IP reserved for any user-facing endpoint
- Cloud Armor (or Cloudflare equivalent) WAF in front of all public endpoints
- DNS records have TTL <=300s for any IP that might change
- Connectivity Tests run after every firewall change
🎓 FOR NEW HIRE, Networking field guide
"VPC" = software-defined network. "Subnet" = the per-region range of IPs. "Firewall rule" = which ports/sources can reach which VMs. "Load balancer" = the front door if we have more than one VM. Use gcloud compute networks list, gcloud compute networks subnets list, gcloud compute firewall-rules list to see the current state. When something is unreachable, the order of debugging is: (1) DNS, (2) Cloudflare, (3) firewall rule, (4) the VM's own iptables, (5) the application listening on the port. The Network Intelligence Center connectivity test does the first three for you.
💾 4. Storage, GCS, Cloud SQL HIGH PRIORITY
Storage on GCP comes in three flavors that matter for nexus-vm: object storage (GCS), block storage (PD/Hyperdisk attached to the VM), and managed databases (Cloud SQL). This section covers all three plus Filestore for shared files.
4.1 GCS storage classes
| Class | Min duration | Storage $/GB-mo | Retrieval $/GB | Use |
|---|---|---|---|---|
| Standard | None | ~$0.020 | $0 | Hot, frequent access |
| Nearline | 30 days | ~$0.010 | $0.01 | Monthly backups |
| Coldline | 90 days | ~$0.004 | $0.02 | Quarterly backups |
| Archive | 365 days | ~$0.0012 | $0.05 | Compliance, long-term |
Min duration means you pay as if the object lived that long even if you delete earlier. All classes have the same single-digit-millisecond first-byte latency, the difference is purely cost-vs-retention. Multi-regional and dual-regional buckets cost slightly more for higher availability. Source: cloud.google.com/storage/pricing.
4.2 Lifecycle, versioning, retention, bucket lock
Lifecycle rules transition or delete objects based on age, version count, or class.
📝 Code, lifecycle for nexus-vm snapshots in GCS
# lifecycle.json
{
"lifecycle": {
"rule": [
{"action": {"type": "SetStorageClass", "storageClass": "NEARLINE"},
"condition": {"age": 30}},
{"action": {"type": "SetStorageClass", "storageClass": "COLDLINE"},
"condition": {"age": 90}},
{"action": {"type": "Delete"},
"condition": {"age": 365}}
]
}
}
gcloud storage buckets update gs://bsp-backups \
--lifecycle-file=lifecycle.json
Object versioning keeps prior versions when you overwrite or delete. Retention policy enforces a minimum age before deletion. Bucket Lock makes a retention policy permanent (cannot be removed, only extended). Use Bucket Lock for compliance buckets.
4.3 IAM vs ACLs on buckets
Two access control systems coexist:
- Uniform bucket-level access (UBLA), recommended, IAM only.
- Fine-grained, legacy, IAM + per-object ACLs. Hard to audit.
Set UBLA at bucket creation: gcloud storage buckets create gs://bsp-backups --uniform-bucket-level-access. The Object ACL system is essentially deprecated for new buckets. Source: cloud.google.com/storage/docs/uniform-bucket-level-access.
4.4 Signed URLs, signed policies, CORS
A signed URL grants time-bounded access to a single object using a service account's private key. CORS lets browser-based JS upload/download from a bucket.
📝 Code, signed URL in Python
from google.cloud import storage
from datetime import timedelta
client = storage.Client()
bucket = client.bucket("bsp-uploads")
blob = bucket.blob("reports/q2-2026.pdf")
url = blob.generate_signed_url(
version="v4",
expiration=timedelta(minutes=15),
method="GET",
)
print(url)
⚠️ Gotcha, signed URLs require a key or signBlob permission
To sign with a service account on a GCE VM, the VM's SA needs iam.serviceAccounts.signBlob on itself. Otherwise you get an opaque error about no private key. Grant roles/iam.serviceAccountTokenCreator on the SA to the SA itself.
4.5 Cloud SQL configurations
| Engine | Versions | Notes |
|---|---|---|
| MySQL | 5.7, 8.0, 8.4 | 5.7 EOL approaching |
| PostgreSQL | 11, 12, 13, 14, 15, 16, 17 | 14, 15, 16 supported. 11 EOL. |
| SQL Server | 2017, 2019, 2022 (Std, Enterprise, Web, Express) | License included or BYOL |
Tiers: shared-core (db-f1-micro, db-g1-small, retired in 2024+ for some versions), custom (1-96 vCPU, 0.9-624 GB RAM), high-memory presets. Storage: SSD or HDD, autogrow available.
4.6 HA, replicas, PITR, backups, maintenance
- High Availability (HA), regional configuration with synchronous standby in a different zone. Failover RTO ~60 seconds, RPO 0. ~2x cost.
- Read replicas, async, in same or different region. For read scaling and DR. Can be promoted to standalone.
- Point-in-time recovery (PITR), requires write-ahead logs enabled, can restore to any second within the retention window (default 7 days, up to 35).
- Automated backups, daily, configurable window. Multi-regional location.
- Maintenance window, weekly, minor version updates. Set to off-hours on Saturday morning.
4.7 Cloud SQL Auth Proxy and IAM auth
The Auth Proxy is a small binary (Go) that establishes a TLS-encrypted tunnel from your application to Cloud SQL using your Google credentials, no password needed for the client side and no firewall rules to manage. IAM authentication lets a Google identity log into Postgres/MySQL with a short-lived token.
📝 Code, run the Cloud SQL Auth Proxy on nexus-vm
# Download (Linux amd64)
curl -o cloud-sql-proxy \
https://storage.googleapis.com/cloud-sql-connectors/cloud-sql-proxy/v2.13.0/cloud-sql-proxy.linux.amd64
chmod +x cloud-sql-proxy
# Connect to a Postgres instance via Unix socket
./cloud-sql-proxy PROJECT_ID:us-central1:bsp-pg --unix-socket=/cloudsql
# In your app, connect to host=/cloudsql/PROJECT_ID:us-central1:bsp-pg
4.8 PD vs Filestore vs GCS, when to use which
| Need | Pick | Why |
|---|---|---|
| Boot disk for nexus-vm | pd-balanced | Default, sized to needed IOPS |
| Database files | pd-ssd or hyperdisk-balanced | Predictable latency |
| Shared files across multiple VMs | Filestore (NFS) | POSIX semantics |
| Daily backups, snapshots, large objects | GCS | Cheap, durable, lifecycle |
| Logs, ML training data, public assets | GCS | Throughput scales horizontally |
| Scratch / cache | Local SSD | Lowest latency, ephemeral |
4.9 Filestore tiers
| Tier | Min capacity | Throughput | Use |
|---|---|---|---|
| Basic HDD | 1 TB | 100 MB/s/TB | Sequential, low cost |
| Basic SSD | 2.5 TB | 1.2 GB/s | Mixed workloads |
| Zonal (Enterprise) | 1 TB | scales linearly | SLA-backed, single zone |
| Regional | 1 TB | scales linearly | HA across zones |
| Enterprise (legacy) | 1 TB | scales linearly | Replaced by Regional |
4.10 Backup and DR strategies for nexus-vm
- Daily PD snapshots on the boot disk (Section 15.7).
- Weekly export of critical data to GCS (regional bucket with versioning).
- Monthly transition to Nearline/Coldline via lifecycle.
- Quarterly DR drill: rebuild a VM from the latest snapshot in a different zone.
✅ Production checklist, Storage
- UBLA on every bucket, no fine-grained ACLs in production
- Versioning + lifecycle on backup buckets
- Bucket Lock on compliance buckets
- Daily snapshot policy on the nexus-vm boot disk
- Weekly export of /opt/nexus to GCS regional bucket
- Cloud SQL HA enabled in production, automated backups, PITR retention >=14 days
- Cloud SQL Auth Proxy used in place of public-IP + password auth
- Signed URLs default expiry <=15 minutes for sensitive content
- CORS rules limited to known origins, not
* - Quarterly DR drill from snapshot to a fresh VM
🎓 FOR NEW HIRE, Storage in one paragraph
GCS holds anything that is not actively being read by a database (assets, backups, logs, ML data). Disks live attached to a VM and hold the OS and database. Cloud SQL is a managed MySQL/Postgres/SQL Server. Pick GCS by default, pick a Disk only when you need POSIX file semantics on a single VM, pick Cloud SQL when you need ACID transactions and don't want to operate Postgres yourself. The Python SDK for GCS is google-cloud-storage, install with pip install google-cloud-storage and the docs are at cloud.google.com/python/docs/reference/storage.
🔒 5. Secret Manager, Cloud KMS MEDIUM PRIORITY
Secret Manager holds the application secrets that nexus-vm needs (Anthropic API key, Cloudflare API token, BRICKS_WP_APP_PASSWORD, Vapi API key). Cloud KMS holds the encryption keys that protect everything else. Different tools, different jobs, often confused.
5.1 Secret Manager: model and versioning
A secret is a named container. Each secret has multiple versions (1, 2, 3...), only one of which is the latest at any time. Versions are immutable. To rotate a secret, add a new version, point your app at latest or pin to a specific version. Each access is auditable.
📝 Code, the standard Secret Manager workflow
# Create a secret
gcloud secrets create BRICKS_WP_APP_PASSWORD \
--replication-policy=automatic
# Add a version (rotation)
echo -n "new_app_password_value" | \
gcloud secrets versions add BRICKS_WP_APP_PASSWORD --data-file=-
# Read the latest version (from inside nexus-vm)
gcloud secrets versions access latest --secret=BRICKS_WP_APP_PASSWORD
# Disable an old version (preferred during rollout)
gcloud secrets versions disable 3 --secret=BRICKS_WP_APP_PASSWORD
# Destroy after the rollout is verified (irreversible)
gcloud secrets versions destroy 3 --secret=BRICKS_WP_APP_PASSWORD
5.2 Replication and CMEK
- Automatic, Google replicates the secret across multiple regions in your jurisdiction. Default. Highest availability.
- User-managed, you pick the regions, useful for data residency.
- CMEK, encrypt the secret at rest with your own KMS key, required for some compliance regimes.
5.3 Rotation and notifications
Secret Manager has built-in rotation scheduling. You set a --next-rotation-time and --rotation-period, and Secret Manager publishes a Pub/Sub message at the scheduled time. Your rotation handler (Cloud Function, Cloud Run job, etc.) creates a new version. Source: cloud.google.com/secret-manager/docs/rotation-recommendations.
5.4 Cloud KMS: model
Hierarchy: key ring → key → key version. Key rings are regional (or global, or multi-regional). A key has a purpose (symmetric encryption, asymmetric signing, asymmetric decryption, MAC), an algorithm, and a protection level (software or HSM). Each version is the actual cryptographic material. KMS never returns the raw key, you call encrypt, decrypt, sign, or verify.
5.5 CMEK on Compute, GCS, Cloud SQL
Customer-Managed Encryption Keys override the default Google-managed encryption with a key in your KMS. Apply at resource create time:
📝 Code, create a GCS bucket with CMEK
gcloud storage buckets create gs://bsp-cmek-test \
--default-encryption-key=projects/PROJECT_ID/locations/us-central1/keyRings/bsp-prod/cryptoKeys/gcs-cmek-1
# Grant GCS service agent permission to use the key
gcloud kms keys add-iam-policy-binding gcs-cmek-1 \
--keyring=bsp-prod --location=us-central1 \
--member=serviceAccount:service-PROJECT_NUMBER@gs-project-accounts.iam.gserviceaccount.com \
--role=roles/cloudkms.cryptoKeyEncrypterDecrypter
5.6 CSEK (legacy) and external HSM
Customer-Supplied Encryption Keys let you provide raw bytes per request. Mostly deprecated in favor of CMEK. External HSM via Cloud HSM (managed) or Cloud External Key Manager (your HSM at a partner like Equinix). Skip unless mandated by compliance.
5.7 Secret Manager into nexus-vm
📝 Code, fetch a secret from Python on nexus-vm
from google.cloud import secretmanager
def get_secret(name: str) -> str:
client = secretmanager.SecretManagerServiceClient()
project = "bsp-prod"
path = f"projects/{project}/secrets/{name}/versions/latest"
response = client.access_secret_version(request={"name": path})
return response.payload.data.decode("utf-8")
bricks_pwd = get_secret("BRICKS_WP_APP_PASSWORD")
cf_token = get_secret("CLOUDFLARE_API_TOKEN")
⚠️ Gotcha, never log secret values
Set up a logging filter that drops any field containing BRICKS_WP_APP_PASSWORD, CLOUDFLARE_API_TOKEN, ANTHROPIC_API_KEY. The fastest way to leak a secret is to print(get_secret(...)) during a debug session and forget. Secret Manager itself only audits "this secret was accessed", which still leaves you to track down where it went.
✅ Production checklist, Secrets and KMS
- Every API key, password, and credential lives in Secret Manager
- Rotation period set on every secret (90 days max for API keys)
- nexus-vm SA has roles/secretmanager.secretAccessor on each secret it needs, scoped not project-wide
- CMEK on the production GCS backup bucket and Cloud SQL instance
- Key rings regional (us-central1) matching the workload
- Audit logs (Data Access) on for Secret Manager and KMS
- Pub/Sub rotation handler tested at least once per quarter
🎓 FOR NEW HIRE, Secret Manager rule
If a value would let someone act as you, it goes in Secret Manager. Period. No .env files committed to git, no values pasted in Slack, no config.py with constants. The nexus-vm service account gets read access at runtime and Secret Manager logs every access for the audit trail.
📊 6. Observability, Logging, Monitoring HIGH PRIORITY
If nexus-vm goes sideways, observability is how you find out before Robert does. Cloud Logging, Cloud Monitoring, Cloud Trace, Profiler, and Error Reporting are five products under the umbrella name "Cloud Operations Suite" (formerly Stackdriver).
6.1 Cloud Logging architecture
Every log entry is a structured JSON document with a timestamp, a log name, a severity, a resource label set, and a payload. Entries flow into buckets (storage with retention), filtered by sinks (which entries go to which bucket or destination). The _Default bucket holds 30 days, the _Required bucket holds Admin Activity audit logs for 400 days, both free.
| Concept | What it is |
|---|---|
| Log entry | One row, structured fields, payload |
| Log name | Logical stream, e.g. cloudaudit.googleapis.com%2Factivity |
| Bucket | Storage location with retention policy |
| Sink | Filter expression + destination (bucket, BigQuery, GCS, Pub/Sub) |
| View | Restricts who sees which entries inside a bucket |
| Scope | Cross-project log scope for a single Logs Explorer |
6.2 Logs Explorer query language
📝 Code, common Logs Explorer queries
# All warnings/errors from nexus-vm in the last hour
resource.type="gce_instance"
resource.labels.instance_id="123456789"
severity>=WARNING
timestamp>="2026-04-28T18:00:00Z"
# IAM policy changes, last 7 days
logName=~"cloudaudit.googleapis.com%2Factivity"
protoPayload.serviceName="iam.googleapis.com"
protoPayload.methodName=~"SetIamPolicy"
# Failed SSH attempts
resource.type="gce_instance"
jsonPayload.message=~"Failed password for"
6.3 Retention, log-based metrics, alerts
Log-based metrics convert a log filter into a counter or distribution. Useful for "alert me when error rate > 1/min."
📝 Code, create a log-based metric for nexus-vm 5xx
gcloud logging metrics create nexus_5xx \
--description="5xx responses on nexus-vm" \
--log-filter='resource.type="gce_instance" AND jsonPayload.status>=500'
6.4 Cloud Monitoring
Workspaces, dashboards, MQL (Monitoring Query Language) for advanced queries, alerts with conditions and notification channels (email, Slack, PagerDuty, webhook). Default integrations exist for every GCE metric (CPU, disk, network, instance/up).
6.5 Uptime checks and SLOs
Uptime checks ping a public URL from 6 global locations every minute. Failures trigger alerts. Configure in Monitoring → Uptime checks. Add an HTTPS check for https://morpheus.callbrightside.com.
📝 Code, define an uptime check
gcloud monitoring uptime create morpheus-https \
--resource-type=uptime-url \
--resource-labels="host=morpheus.callbrightside.com,project_id=PROJECT_ID" \
--request-method=GET --path=/ --port=443 \
--period=60s --timeout=10s
6.6 Cloud Trace, Profiler, Error Reporting
- Cloud Trace, distributed tracing, OpenTelemetry-compatible. Auto-trace HTTP via OT instrumentation.
- Cloud Profiler, sample-based CPU and heap profiler with low overhead. Add the agent to a long-running process and view flamegraphs in the console.
- Error Reporting, groups identical errors across services, dedupes, alerts when a new fingerprint appears.
6.7 Ops Agent vs deprecated agents
The Ops Agent (single binary, Linux + Windows) replaces the legacy Stackdriver Logging Agent and Monitoring Agent. Configures via /etc/google-cloud-ops-agent/config.yaml.
📝 Code, ship Python framework logs from /opt/nexus to Cloud Logging
# /etc/google-cloud-ops-agent/config.yaml
logging:
receivers:
nexus_app:
type: files
include_paths:
- /opt/nexus/logs/*.log
- /opt/nexus/nexus/scripts/output/*.log
record_log_file_path: true
processors:
json_parse:
type: parse_json
service:
pipelines:
nexus_pipeline:
receivers: [nexus_app]
processors: [json_parse]
metrics:
receivers:
hostmetrics:
type: hostmetrics
collection_interval: 60s
service:
pipelines:
default_pipeline:
receivers: [hostmetrics]
# Apply
sudo systemctl restart google-cloud-ops-agent
🔥 Recency, legacy agents EOL
The legacy Logging Agent and Monitoring Agent were deprecated in 2023 and reach end of support October 2024 / Q1 2025. New installs must use Ops Agent. If you find google-fluentd or stackdriver-agent on the VM, that is an upgrade you owe yourself. Source: cloud.google.com/stackdriver/docs/deprecations.
6.8 Audit logs (cross-reference)
See Section 2.9. Admin Activity is always on, free, 400-day retention.
6.9 Common alerting patterns
- VM down, alert on
compute.googleapis.com/instance/up= 0 for 3 minutes. - Disk usage, alert on
agent.googleapis.com/disk/percent_used> 85% for 5 minutes. - Memory pressure, alert on
agent.googleapis.com/memory/percent_used> 90% for 5 minutes. - Auth failures, log-based metric on
Failed password, alert when rate > 5/min. - Cost spike, billing budget at 50%/75%/100% with email + Pub/Sub.
6.10 Observability cost
Logging: $0.50/GiB ingest, free egress to bucket, then storage $0.01/GiB-mo after 30 days for default bucket. Monitoring: free for resource metrics, $0.2580/MiB for chargeable metrics. Trace: $0.20 per million spans. Profiler: free.
6.11 Shipping /opt/nexus logs to Cloud Logging
Section 6.7's config is the right answer. Two patterns to know:
- Structured JSON logs are auto-parsed when the file extension is
.logand the line is valid JSON. The fields become labels. - Python's standard
loggingcan write directly viagoogle-cloud-logging's handler, no Ops Agent needed for that path.
📝 Code, Python direct logging into Cloud Logging
import logging
from google.cloud import logging as cloud_logging
cloud_logging.Client().setup_logging(log_level=logging.INFO)
logger = logging.getLogger("nexus.runner")
logger.info("job_started", extra={"json_fields": {"job_id": "abc"}})
✅ Production checklist, Observability
- Ops Agent installed and running on nexus-vm
- /opt/nexus/logs ingested into Cloud Logging via Ops Agent files receiver
- Structured JSON used for all application log lines
- Log-based metrics for: error rate, auth failures, slow queries
- Alert policies for: VM down, disk > 85%, mem > 90%, error spike
- Notification channels: Robert email + Slack #ops
- Uptime check on morpheus.callbrightside.com from 6 regions
- Log retention: _Default 30d, custom audit bucket 365d
- Sink to GCS for compliance archive (Coldline, lifecycle)
- Quarterly review of unused dashboards and noisy alerts
🎓 FOR NEW HIRE, Observability mental model
Logs answer "what happened." Metrics answer "is it normal." Traces answer "where did the request spend time." Errors answer "what's broken." Start with logs and metrics; learn traces when you need them. The first place to look during an incident is Logs Explorer, filter to severity>=ERROR and the time window you care about. The Logs Explorer query language is documented at cloud.google.com/logging/docs/view/logging-query-language.
📝 7. APIs, Auth, SDKs MEDIUM PRIORITY
Every Google Cloud product is an HTTP API. Every API call passes through three checkpoints: API enablement (is the service turned on for this project), authentication (who is calling), and authorization (Section 2 IAM). Understanding the layers makes 401/403 errors trivial.
7.1 APIs catalog and enablement
Each API is identified by a service name like compute.googleapis.com, storage.googleapis.com, secretmanager.googleapis.com, aiplatform.googleapis.com. APIs must be explicitly enabled per project before any call works.
📝 Code, manage API enablement
# List enabled APIs
gcloud services list --enabled
# Enable an API
gcloud services enable secretmanager.googleapis.com
# Disable (will fail if resources exist)
gcloud services disable bigtable.googleapis.com
7.2 OAuth 2.0 flows
| Flow | Use |
|---|---|
| Authorization Code | Web apps acting on behalf of a user |
| Authorization Code + PKCE | Native and mobile apps |
| Client Credentials (JWT bearer) | Service accounts |
| Implicit (legacy) | Avoid |
| Device flow | TVs, CLI on a remote box |
7.3 Application Default Credentials (ADC)
The discovery order Google client libraries follow when looking for credentials:
GOOGLE_APPLICATION_CREDENTIALSenv var pointing at a JSON key file- gcloud user credentials in
~/.config/gcloud/application_default_credentials.json - Attached service account on a GCE VM, Cloud Run, Cloud Functions, App Engine, GKE Workload Identity
- External account (WIF) configured via
gcloud iam workload-identity-pools create-cred-config
Use gcloud auth application-default login for local development. On nexus-vm, no env var is needed; the attached SA is auto-discovered via the metadata service.
7.4 gcloud config and named profiles
📝 Code, multi-account gcloud profiles
gcloud auth login # browser flow
gcloud config configurations create bsp-prod
gcloud config set project bsp-prod
gcloud config set compute/zone us-central1-a
gcloud config set account robert.dove@callbrightside.com
gcloud config configurations list
gcloud config configurations activate bsp-prod
7.5 Cloud Shell vs local development
Cloud Shell is a free Linux VM in your browser pre-loaded with gcloud, kubectl, terraform, docker, python, node. Persists 5 GB of $HOME. Sessions auto-expire after 60 minutes idle. Useful when your local machine doesn't have gcloud, or when you want to test as a different identity without polluting local config.
7.6 Python SDK clients
| Library | Install | For |
|---|---|---|
| google-cloud-storage | pip install google-cloud-storage | GCS |
| google-cloud-secret-manager | pip install google-cloud-secret-manager | Secret Manager |
| google-cloud-compute | pip install google-cloud-compute | Compute Engine API |
| google-cloud-logging | pip install google-cloud-logging | Cloud Logging |
| google-cloud-monitoring | pip install google-cloud-monitoring | Cloud Monitoring |
| google-cloud-pubsub | pip install google-cloud-pubsub | Pub/Sub |
| google-cloud-aiplatform | pip install google-cloud-aiplatform | Vertex AI |
| google-cloud-bigquery | pip install google-cloud-bigquery | BigQuery |
7.7 REST vs gRPC
Google client libraries default to gRPC where supported (faster, streaming, smaller wire). REST is the fallback (firewall friendlier, easier to debug with curl). Most products support both; the Python SDK abstracts the choice. Compute Engine, Cloud SQL Admin, and a handful of older APIs are REST-only.
7.8 API versioning
Versions follow v1, v1beta1, v1alpha1. Beta is supported for production but breaking changes may occur. Alpha is allowlist-only. Pin to a specific version in your client library imports to avoid surprises.
7.9 Quotas and rate limiting
Every API has per-minute and per-day quotas. View at IAM & Admin → Quotas & System Limits. Common surprises: Compute Engine API Persistent disks (GB) regional quota, Cloud Logging Write API requests per minute, Cloud Functions Concurrent function executions. Request increases via the console; turnaround is hours-to-days.
7.10 Backoff and retry patterns
📝 Code, exponential backoff with the Google client library
from google.api_core import retry
from google.cloud import storage
custom_retry = retry.Retry(initial=1.0, multiplier=2.0, maximum=30.0, deadline=300.0)
client = storage.Client()
bucket = client.bucket("bsp-backups")
blob = bucket.blob("daily.tar.gz")
blob.upload_from_filename("daily.tar.gz", retry=custom_retry)
⚠️ Gotcha, idempotency and retries
Auto-retry only works safely on idempotent operations (GET, PUT with full payload, DELETE). For non-idempotent POSTs (create instance), wrap in requestId to dedupe. Compute Engine accepts a requestId on most insert operations.
✅ Production checklist, APIs
- Only the APIs you actually use are enabled (audit quarterly)
- nexus-vm runs as a dedicated SA, not the default Compute SA
- ADC discovery used everywhere, no hardcoded key paths in code
- gcloud profiles separate prod from sandbox
- Pinned client library versions in requirements.txt with hashes
- Exponential backoff for any rate-limited API
- Quota dashboards monitored, alerts at 80% utilization
🎓 FOR NEW HIRE, the Python + GCP starter kit
You will live mostly in Python. The "official" Google client libraries follow a consistent shape: Client object → resource methods. Read cloud.google.com/python/docs/reference as your bookmark page. Bash and gcloud are the second language for ops scripting. Go and TypeScript surface in two contexts only: Cloud Functions / Cloud Run if we go serverless (TS or Python or Go), and the Cloud SQL Auth Proxy / Ops Agent (Go internals). You can ship effective work for years on Python + Bash + a little gcloud.
🚀 8. Build, Deploy, IaC MEDIUM PRIORITY
How code gets from a git push to running on production. The BSP nexus-vm stack today is updated by SSH + git pull + systemd restart. The mature future is Cloud Build → Artifact Registry → deploy.
8.1 Cloud Build
Hosted CI. Each build runs in a sandbox using a sequence of Docker steps defined in cloudbuild.yaml. Triggers on git push (GitHub, GitLab, Bitbucket, Cloud Source Repositories) or webhook. Outputs land in Artifact Registry, GCS, or anywhere the build calls.
📝 Code, minimal cloudbuild.yaml for nexus-vm Python sync
# cloudbuild.yaml
steps:
- name: "python:3.11-slim"
entrypoint: bash
args:
- -c
- "pip install -r requirements.txt && python -m pytest -q"
- name: "gcr.io/cloud-builders/gcloud"
args: ["compute", "ssh", "nexus-vm", "--zone=us-central1-a",
"--command=cd /opt/nexus && git pull && sudo systemctl restart nexus.service"]
options:
logging: CLOUD_LOGGING_ONLY
timeout: "600s"
8.2 Artifact Registry
The successor to Container Registry (gcr.io). Holds container images, plus Maven, npm, Python (PyPI-style), Apt, Yum, Go module, generic file repos. Regional or multi-regional. Per-repo IAM. Vulnerability scanning available (Container Analysis API). Source: cloud.google.com/artifact-registry/docs.
🔥 Recency, Container Registry sunset
Container Registry (gcr.io) was deprecated in 2023 and is being shut down. New images go to Artifact Registry. Existing gcr.io/PROJECT/image URLs auto-redirect via the pkg.dev Artifact Registry mirror, but you should migrate explicitly. Run the migration tool: gcloud artifacts docker upgrade migrate.
8.3 Cloud Deploy
Managed delivery pipeline (continuous delivery). Stages a release through a chain of environments (dev → staging → prod) with manual or automatic promotion gates. Native targets are GKE, Cloud Run, and recently GCE MIG. For a single VM, the value is lower; we lean on Cloud Build directly today.
8.4 IaC choices: Terraform, Deployment Manager, gcloud, Config Connector
| Tool | Status | Use |
|---|---|---|
| Terraform | Industry standard, recommended | Multi-cloud, broad community |
| Deployment Manager | Deprecated 2024, EOL | Legacy projects only, migrate |
| gcloud / scripts | Active | One-offs, ad hoc |
| Config Connector | Active | Manage GCP from inside K8s, GitOps |
| Pulumi | Active third-party | If you prefer real code over HCL |
📝 Code, minimal Terraform for nexus-vm
# main.tf
terraform {
required_version = ">= 1.5"
required_providers {
google = { source = "hashicorp/google", version = "~> 5.0" }
}
backend "gcs" {
bucket = "bsp-tfstate"
prefix = "nexus/prod"
}
}
provider "google" {
project = "bsp-prod"
region = "us-central1"
}
resource "google_compute_instance" "nexus_vm" {
name = "nexus-vm"
machine_type = "n2-standard-2"
zone = "us-central1-a"
boot_disk { initialize_params { image = "debian-cloud/debian-12", size = 50 } }
network_interface {
network = google_compute_network.bsp_prod_vpc.id
subnetwork = google_compute_subnetwork.nexus_subnet.id
access_config { nat_ip = google_compute_address.nexus_static.address }
}
service_account {
email = google_service_account.nexus_runner.email
scopes = ["cloud-platform"]
}
shielded_instance_config { enable_secure_boot = true, enable_vtpm = true, enable_integrity_monitoring = true }
tags = ["web", "ssh-iap"]
}
8.5 Cloud Source Repositories
Google's git hosting. Free for up to 5 users, 50 GB. Mirrors GitHub repos for inside-VPC pulls. Mostly used as a Cloud Build source mirror. Today BSP source lives on GitHub; CSR is optional.
8.6 GitHub Actions / GitLab CI integration via WIF
See Section 2.4. The keyless path replaces uploading a service account JSON key to GitHub. The official action is google-github-actions/auth@v2.
📝 Code, GitHub Actions step using WIF
# .github/workflows/deploy.yml
- uses: google-github-actions/auth@v2
with:
workload_identity_provider: "projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/gh-pool/providers/gh-provider"
service_account: "ci-deployer@bsp-prod.iam.gserviceaccount.com"
- run: gcloud compute ssh nexus-vm --zone=us-central1-a --command="cd /opt/nexus && git pull && sudo systemctl restart nexus"
8.7 Terraform state on GCS
Use a dedicated GCS bucket as Terraform's remote backend. Enable versioning, locking via TF state lock (default for GCS backend in TF 1.6+). Restrict the bucket IAM to only the CI service account and humans who need to import state by hand.
8.8 Deployment patterns for nexus-vm
- Today, manual SSH +
git pull+systemctl restart. Works for a single operator. - Step up, Cloud Build trigger on push to
mainruns tests, then SSHes in to deploy. Adds an audit trail. - Mature, build a custom image weekly, swap a MIG of size 1, drain old VM. Adds rollback.
8.9 Rollbacks
Without IaC, rollback is "git revert + redeploy." With snapshots and instance templates, rollback is "revert MIG to template version v(N-1)" which can complete in minutes.
✅ Production checklist, Build & Deploy
- All infra in Terraform with state on GCS, versioning on, locked
- Cloud Build trigger on main, runs tests before deploy
- CI uses Workload Identity Federation, no JSON keys
- Artifact Registry with vulnerability scanning enabled
- No images on legacy gcr.io (run upgrade tool)
- Build logs streamed to Cloud Logging, retained 90 days
- Deploy is reversible inside 5 minutes (snapshot or git revert)
- Rollback drill quarterly
🎓 FOR NEW HIRE, deploy paths
Today: SSH in, git pull in /opt/nexus, sudo systemctl restart nexus.service. Always look at journalctl -u nexus.service -n 100 -f after a deploy. Tomorrow: a GitHub Actions workflow does it for you when a PR merges. Either way, the pattern of "pull + restart + watch logs" is the same.
⚡ 9. Serverless LOWER PRIORITY
Serverless on GCP means "you bring code, Google runs it on demand." Lower priority for our single-VM stack today, but the right answer for many things we currently shoehorn into nexus-vm cron.
9.1 Cloud Run
The flagship. Containers (any language, any base image), HTTP and grpc, scales 0 to N. Pay per 100ms of CPU + memory while serving. Two flavors: services (long-running HTTP) and jobs (run to completion). Concurrency per container configurable, 1 to 1000. Default 80.
| Feature | Detail |
|---|---|
| Cold start | ~100ms-2s depending on image size |
| Request timeout | up to 60 minutes (services), 24 hours (jobs) |
| Memory | 128 MiB to 32 GiB |
| CPU | 1, 2, 4, 8 vCPU |
| Min instances | 0 default; set > 0 to avoid cold starts at a cost |
| VPC connector | Direct VPC egress (preview/GA), or Serverless VPC Access |
| Auth | Public, IAP-protected, or invoker IAM |
📝 Code, deploy a Python Cloud Run service from source
gcloud run deploy nexus-helper \
--source=. --region=us-central1 \
--allow-unauthenticated --memory=512Mi --cpu=1 \
--service-account=nexus-runner@bsp-prod.iam.gserviceaccount.com \
--set-secrets=ANTHROPIC_API_KEY=anthropic-key:latest
9.2 Cloud Functions Gen 1 vs Gen 2
Gen 2 is Cloud Run under the hood with a function-shaped interface. Gen 1 is the legacy. New code goes to Gen 2 (longer timeouts, larger instances, concurrency > 1, better events). Gen 1 is in maintenance.
9.3 App Engine Standard vs Flex
- Standard, language sandbox (Python, Node, Go, Java, PHP, Ruby), instant scale, but legacy programming model.
- Flexible, your container on GCE behind a managed LB. Largely superseded by Cloud Run.
For new builds, default to Cloud Run unless you have an existing App Engine app.
9.4 Pub/Sub
Globally available message queue. Decouples producers and consumers. Two delivery modes: push (Pub/Sub posts to your endpoint) and pull (your worker fetches). At-least-once by default, exactly-once available with ordering keys + filtering. Schema validation, dead letter topics. Source: cloud.google.com/pubsub/docs.
9.5 Cloud Scheduler, Tasks, Workflows
| Service | Use |
|---|---|
| Cloud Scheduler | Cron-as-a-service. Hits HTTP, Pub/Sub, App Engine. |
| Cloud Tasks | Per-item queue with rate limiting, dispatch retry, delay. |
| Workflows | YAML state machine for multi-step orchestration. |
9.6 Eventarc
Event router that turns audit log events, GCS object writes, Pub/Sub messages, BigQuery jobs, and SaaS webhooks into Cloud Run / GKE invocations. Use for "when an object lands in this bucket, run that handler" without writing glue.
9.7 Cost models
- Cloud Run: $0.0000180/vCPU-s + $0.000002/GiB-s + $0.40/M requests. First 240k vCPU-s and 450k GiB-s/month free.
- Cloud Functions Gen 2: same as Cloud Run.
- Pub/Sub: $40/TiB ingest, $40/TiB delivery. First 10 GiB/month free.
- Cloud Scheduler: 3 jobs free, then $0.10/job-month.
- Workflows: $0.01/1000 internal steps.
✅ Production checklist, Serverless
- Cloud Run services use a non-default SA with least privilege
- Sensitive Cloud Run services require auth (IAM invoker)
- Min instances > 0 only where cold start is unacceptable
- Pub/Sub topics have dead letter policies
- Cloud Scheduler jobs idempotent and have a max retry policy
- Workflows replace shell scripts when there are >3 steps with branching
🎓 FOR NEW HIRE, when to reach for serverless
Anything that runs on a schedule and finishes in under 10 minutes is a great Cloud Scheduler + Cloud Run job candidate. Anything that responds to events (an upload arrived, a Pub/Sub message landed) is Cloud Run or Functions. Anything that runs continuously and holds state is the VM. We default to "put it on the VM" today, but if you find yourself reaching for cron, ask yourself if Cloud Scheduler is nicer.
🏢 10. Project, Org, Billing MEDIUM PRIORITY
Projects are the unit of cost, quota, and IAM. Organizations are the unit of governance. Billing accounts pay the bills. The wiring matters more than people realize.
10.1 Resource hierarchy
Org → Folder (optional, can nest) → Project → Resources. Most BSP work happens in one project (bsp-prod). Recommended additions: bsp-sandbox for safe experiments, bsp-data for BigQuery and analytics with separate billing visibility.
10.2 Organization policies
Constraints applied at Org/Folder/Project level that override IAM. Examples:
compute.requireOsLogin, force OS Login on every VMcompute.disableSerialPortAccess, no serial consoleiam.disableServiceAccountKeyCreation, no SA key downloadsstorage.publicAccessPrevention, noallUsersreadsstorage.uniformBucketLevelAccess, force UBLA on new bucketscompute.vmExternalIpAccess, allowlist external IPsiam.allowedPolicyMemberDomains, restrict who can be granted IAM
📝 Code, set an org policy
gcloud resource-manager org-policies set-policy policy.yaml \
--organization=ORG_ID
# policy.yaml
constraint: constraints/iam.disableServiceAccountKeyCreation
booleanPolicy: { enforced: true }
10.3 Custom org policy constraints
2023+ feature. Define your own constraint in CEL targeting any GCP resource field. Example: "every GCS bucket must be in us-central1." Source: cloud.google.com/resource-manager/docs/organization-policy/custom-constraints.
10.4 Quotas
Per-project, per-region, per-API. Soft limits, increase via console request. Common single-VM ones to know:
- Compute Engine: CPUs (regional)
- Compute Engine: Persistent Disk SSD (GB) (regional)
- Compute Engine: In-use IP addresses (regional)
- Cloud Logging: Log entries per second
10.5 Billing accounts and BigQuery exports
Billing accounts pay one or many projects. Separate the production billing account from the sandbox so a runaway dev VM does not blow the prod budget. Enable BigQuery billing export for accurate per-resource cost analysis.
📝 Code, enable BigQuery billing export
# In the Console: Billing → Billing export → BigQuery export
# Creates a dataset like: PROJECT.billing_export
# Sample query: top 10 services by cost last 30 days
SELECT
service.description AS service,
ROUND(SUM(cost), 2) AS cost_usd
FROM `PROJECT.billing_export.gcp_billing_export_v1_BILLING_ID`
WHERE usage_start_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY service
ORDER BY cost_usd DESC
LIMIT 10;
10.6 Budgets and alerts
Set per-billing-account or per-project budgets with alerts at thresholds (50%, 75%, 90%, 100%, 150% of forecast). Notifications go to email and Pub/Sub. Pub/Sub can trigger automatic remediation (e.g. shut down sandbox VMs).
10.7 Asset Inventory and Recommender
- Cloud Asset Inventory, queryable snapshot of every resource and IAM binding. Export to BigQuery for compliance reporting. Real-time feed via Pub/Sub.
- Recommender, ML-driven suggestions per resource: idle VMs, oversized VMs, idle PDs, IAM role recommendations.
- Active Assist, umbrella term for Recommender + Policy Intelligence + Insight feeds.
10.8 Project metadata: labels and tags
Two distinct concepts:
- Labels, key/value strings for billing breakdown and search. Set per-resource. Examples:
env=prod,component=nexus. - Tags, hierarchical key/value pairs that participate in IAM (deny policies, conditional bindings) and firewall rules. Set at Org or Project level.
✅ Production checklist, Project & Org
- Org policies enforced for: requireOsLogin, disableServiceAccountKeyCreation, publicAccessPrevention, uniformBucketLevelAccess
- Production isolated from sandbox in different projects with different billing accounts
- Budgets set on every billing account with email + Pub/Sub alerts
- BigQuery billing export running
- Asset Inventory feed to a SecOps bucket for compliance
- Quotas reviewed quarterly, requested increases preempt growth
- Standard label taxonomy applied to every resource (env, component, owner)
🎓 FOR NEW HIRE, project anatomy
If you ever feel "I want to try this without risk," ask Robert to point you at the sandbox project. Production isolation is real. The first thing in any console session: check the project picker top-left and confirm you are in the right project. The number of incidents caused by being in the wrong project is non-trivial.
🛡️ 11. Security Operations MEDIUM PRIORITY
Security Operations on GCP is the sum of detection (Security Command Center), guardrails (Org Policy, Binary Auth, VPC-SC), and forensics (Audit Logs, Asset Inventory).
11.1 Security Command Center tiers
| Tier | Cost | Capabilities |
|---|---|---|
| Standard | Free | Findings: Web Security Scanner, Sensitive Action Service, exposed assets |
| Premium | Per-vCPU + per-bucket pricing | + Event Threat Detection, Container Threat Detection, VM Threat Detection, Posture, Compliance reports (CIS, PCI, HIPAA) |
| Enterprise | Higher tier | + Mandiant threat intel, MISP integration, SOC features |
11.2 Threat detection
- Event Threat Detection, scans Cloud Logging for indicators (suspicious IAM, SSH brute force, malware DNS).
- Container Threat Detection, GKE-only, runtime detection.
- VM Threat Detection, agent-less host introspection on GCE VMs (memory scan).
11.3 Vulnerability scanning
Container Analysis API scans Artifact Registry images on push. Web Security Scanner runs against your App Engine / Cloud Run / GCE web apps to find OWASP issues. Free tier of SCC includes both at limited frequency.
11.4 Compliance reports
SCC Premium includes pre-built reports against frameworks: CIS GCP Benchmark v1.3, PCI DSS, HIPAA, NIST 800-53, ISO 27001, SOC 2, FedRAMP Moderate/High. Each report shows compliant vs non-compliant resources.
11.5 Binary Authorization and Container Analysis
Binary Auth gates GKE/Cloud Run/Anthos so that only signed, attested container images run. Pair with Container Analysis to require a "no high CVEs" attestation. Out of scope for single-VM nexus-vm but the pattern to know if we move to containers.
11.6 Cloud DLP (Sensitive Data Protection)
Detect and redact PII in text, images, and BigQuery. InfoTypes include US_SSN, EMAIL_ADDRESS, CREDIT_CARD_NUMBER, custom regex. Use during ingest of customer data into BSP analytics: scan a sample with DLP, decide whether the field is allowed.
11.7 Access Transparency and Approval
- Access Transparency, logs every Google support engineer access to your data with reason. Available on Premium support+.
- Access Approval, requires you to approve before a Google engineer can access your data, a few seconds delay added.
11.8 Control-plane CMEK
Beyond data CMEK (Section 5.5), control-plane CMEK encrypts metadata about your resources (config, IAM bindings) with your key in some products. Niche, but compliance-relevant.
✅ Production checklist, Security Operations
- SCC Standard enabled on the org (free)
- Web Security Scanner runs against morpheus.callbrightside.com weekly
- Image vulnerability scanning on Artifact Registry repos
- Audit logs archived to a write-only bucket with Bucket Lock
- VPC-SC perimeter around production data buckets and Cloud SQL
- Findings triaged inside 7 days for high, 24 hours for critical
- Quarterly tabletop incident response exercise
🎓 FOR NEW HIRE, the security mindset
Default to least privilege. When you write code that needs a permission, grant the smallest predefined role you can find, or build a custom role. Treat every "could this credential leak" with paranoia. The cheapest way to get hacked is a leaked SA key in a public repo, the cheapest defense is Workload Identity Federation. Read the SCC findings tab once a week and learn what the org looks like to a defender.
💰 12. Cost HIGH PRIORITY
A nexus-vm-sized stack on GCP is cheap if you watch it, expensive if you don't. The expensive surprises are predictable; the cheap path is small habits.
12.1 Pricing models
- Pay as you go, default. Per-second compute, per-GB storage, per-GB egress.
- Sustained use discounts, automatic, kicks in >25% of the month for N1/E2 family.
- Committed use discounts, 1- or 3-year, 37%-55% off list, opt-in.
- Spot VMs, 60-91% off list, can be reclaimed.
- Free tier, persistent monthly free amounts (e2-micro 1 in us-east1/us-west1/us-central1, 5 GB GCS, etc.).
12.2 Free tier
| Service | Free per month |
|---|---|
| Compute Engine | 1 e2-micro in us-central1/us-east1/us-west1, 30 GB pd-standard, 1 GB egress to most |
| GCS | 5 GB Standard, 5,000 Class A ops, 50,000 Class B ops, 1 GB egress |
| Cloud Run | 2M requests, 360,000 vCPU-s, 180,000 GiB-s |
| Cloud Functions | 2M invocations, 400k GB-s, 200k GHz-s |
| Cloud Logging | 50 GiB ingest, 30-day retention |
| Cloud Monitoring | All resource metrics + 150 MiB chargeable |
| Cloud Build | 120 build-minutes/day |
| Pub/Sub | 10 GiB |
| Secret Manager | 6 active secrets, 10k access ops |
12.3 Optimization strategies
- Right-size, use Recommender's "Right-size VMs" insight monthly.
- Lifecycle GCS, auto-tier to Nearline at 30 days, Coldline at 90, delete or archive at 365.
- Schedule sandbox shutdowns, Cloud Scheduler stops dev VMs nights and weekends.
- Buy CUDs, when steady state is locked in.
- Tune log ingest, exclude noisy log lines via sink filters before they hit storage.
- Avoid cross-region egress, keep workloads in the same region as their data.
12.4 Billing exports to BigQuery
See Section 10.5. The detailed export includes per-resource cost broken down by SKU.
12.5 Common cost surprises
| Surprise | Why | Mitigation |
|---|---|---|
| Egress charges | Cross-region or to-internet egress is $0.01-$0.12/GB | Keep data in-region, use Cloud CDN, compress responses |
| Cloud NAT data processing | $0.045/GB processed + $0.0045/hr per IP | Use Private Google Access for googleapis.com endpoints |
| Log ingest | $0.50/GiB beyond 50 GiB | Drop noisy log lines via Sink filter exclusions |
| Snapshot accumulation | Snapshots compound until you delete | Lifecycle on snapshot schedule (e.g. keep 7 daily, 4 weekly, 12 monthly) |
| Idle static IP | $0.005/hr while not attached to running VM | Release unused IPs |
| Cloud Logging rehydration | Fetching logs older than retention is expensive | Stream to GCS or BQ before retention cliff |
| Cloud SQL HA when not needed | 2x cost | Disable HA on dev/sandbox |
12.6 nexus-vm specific cost analysis
Rough monthly estimate, assuming n2-standard-2 (2 vCPU, 8 GB RAM) running 24/7 in us-central1, 50 GB pd-balanced, 1 static IP, ~5 GB Cloud Logging, ~50 GB GCS Standard for backups:
| Item | Detail | $/month |
|---|---|---|
| n2-standard-2 vCPU | 2 vCPU x 730 hrs x $0.0317 | ~$46.30 |
| n2-standard-2 RAM | 8 GiB x 730 hrs x $0.00425 | ~$24.83 |
| Sustained use discount | ~10% off N2 (auto) | -$7.10 |
| 50 GB pd-balanced boot disk | 50 x $0.10 | ~$5.00 |
| Static external IP (in use) | 730 hrs x $0.000 | ~$0.00 |
| Egress to internet | ~10 GB x $0.085 (assumes US-to-most) | ~$0.85 |
| Cloud Logging | 5 GB ingest, free under 50 GiB | $0.00 |
| GCS Standard backups | 50 GB x $0.020 | ~$1.00 |
| Snapshot storage | ~30 GB compressed x $0.026 | ~$0.78 |
| Secret Manager | ~10 secrets, 1k ops/mo | ~$0.06 |
| Total estimate | List minus SUD | ~$71.72 |
💡 Insight, where a 1-yr CUD pays back
A 1-year resource-based CUD on 2 vCPU + 8 GB RAM in us-central1 saves ~37% off N2 list. That is roughly $26/mo savings on a $70 base, paying back inside the first month and locking in the rate for 12 months. Caveat: you keep paying for the committed amount even if you delete the VM.
⚠️ Gotcha, egress is the most-asked-about line
If your monthly bill jumps by $50 unexpectedly, look at egress first. A misconfigured backup that pulls 600 GB to a non-Google destination is $50 of egress out of nowhere. Run the BQ billing query by sku filter on %egress%.
✅ Production checklist, Cost
- Budgets per project with 50/75/100% alerts
- BigQuery billing export running, dashboards built
- 1-year CUD evaluated for steady-state nexus-vm
- GCS lifecycle on every long-lived bucket
- Snapshot retention policy (max 30 daily snapshots)
- Idle static IPs released monthly
- Logging exclusion filters for noisy services (e.g. health check 200s)
- Sandbox auto-stop schedule via Cloud Scheduler
- Quarterly Recommender review
🎓 FOR NEW HIRE, cost discipline in 90 seconds
Before you create anything, ask: how much does this cost per month if I forget to delete it? GCP bills by the second; a forgotten dev VM at $50/month is $1.65/day. The dashboard at Console → Billing → Reports tells you in <30 seconds. Bookmark it. Look at it weekly.
🖥️ 13. Console UI LOWER PRIORITY
The web console at console.cloud.google.com is mostly self-explanatory, but a handful of patterns save real time.
13.1 URL structure
Every page has a deep link. https://console.cloud.google.com/compute/instances?project=PROJECT_ID jumps directly to the instance list for a project. Bookmark these for the resources you visit daily.
13.2 Cloud Shell
Click the >_ icon top-right to open a Linux shell in your browser, no install. Pre-loaded with gcloud, kubectl, terraform, docker, python, node, vim. $HOME persists 5 GB. Sessions expire after 60 minutes idle.
13.3 Activity feed
Console → Home → Activity. A timeline of every Admin Activity audit event in the project. Useful first stop for "who changed what."
13.4 Search bar
Top-bar search auto-completes resource names across services. Search for nexus-vm and you get the GCE instance, related disks, snapshots, and any logs entries that mention it. Faster than navigating menus.
13.5 Dashboard customization
Console → Home → Dashboard. Add/remove tiles. Pin Monitoring dashboards. Useful for an at-a-glance ops view.
13.6 Mobile app
"Cloud Console" app for iOS/Android. Useful for: viewing alerts, restarting a VM in a pinch, checking the bill on a Sunday morning. Do not run major IaC changes from a phone.
🎓 FOR NEW HIRE, console productivity
Three habits: (1) confirm the project picker every time you open a new tab, (2) press / to jump to search, (3) star resources to pin them in the navigation drawer. The console is fine for exploration; for any change that needs an audit trail, prefer gcloud or Terraform so the change is reviewable.
🔍 14. Troubleshooting HIGH PRIORITY
When something is on fire, a runbook beats panic. This section is the runbook for the failure modes you will actually hit on nexus-vm.
14.1 Cloud Debugger deprecation
🔥 Recency, Cloud Debugger removed
Cloud Debugger was sunset in May 2023. Replacement: Cloud Profiler for performance, plus modern OpenTelemetry-based debugging in your IDE. If you find docs referencing Cloud Debugger, ignore them.
14.2 Connectivity Tests and the NIC introspection
Console → Network Intelligence → Connectivity Tests. Define a source (VM, IP, internet) and destination, run a simulated path. Tells you which firewall rule, route, or peering blocked the traffic. Saves hours of guessing.
📝 Code, run a connectivity test from CLI
gcloud network-management connectivity-tests create nexus-from-cf \
--source-ip-address=104.16.0.1 \
--destination-instance=projects/PROJECT_ID/zones/us-central1-a/instances/nexus-vm \
--destination-port=443 --protocol=TCP
gcloud network-management connectivity-tests describe nexus-from-cf
14.3 Troubleshooter wizards
The console has wizards for: IAM "why can't user X do Y", VPC "why can't VM A reach B", LB "why is health check failing". Run them before guessing.
14.4 Quotas page
IAM & Admin → Quotas & System Limits. When an API call returns 429, look here first. Filter to the service whose quota you suspect.
14.5 Error code reference
| Code | Meaning | First check |
|---|---|---|
| 400 | Bad request | Validate request body, region, zone names |
| 401 | Unauthenticated | ADC discovery, expired token, wrong gcloud config |
| 403 | Permission denied | Missing IAM role; check exact permission string in error |
| 404 | Not found | Resource name typo, wrong project, wrong region |
| 409 | Conflict | Concurrent modification, wait and retry |
| 429 | Too many requests | Quota or rate limit; check Quotas page |
| 500 | Internal error | Retry with backoff; check status.cloud.google.com |
| 503 | Service unavailable | Regional outage; check status page; retry |
14.6 IAM troubleshooter step-by-step
- Copy the exact
permissionstring from the 403 (e.g.compute.instances.start). - Open Console → IAM & Admin → Troubleshoot.
- Enter the user/SA email and the resource (instance URL).
- Click "Check access." It returns the inherited bindings and the missing permission.
- Grant the smallest predefined role (look it up in cloud.google.com/iam/docs/understanding-roles) that contains the permission.
- Re-run the failing call. If still 403, check Org Policy and Deny policies.
✅ Production checklist, Troubleshooting
- Connectivity Tests scripted for the top 5 failure paths
- Status page (status.cloud.google.com) bookmarked
- Runbook with "first 5 minutes" steps for: VM unreachable, 5xx spike, cost spike, auth failures
- On-call rotation acknowledged and tested for alert delivery
- Monthly tabletop drill on a different scenario each time
🎓 FOR NEW HIRE, the calmness algorithm
(1) Read the error literally. (2) Map it to a section in this doc. (3) Run the troubleshooter / connectivity test before guessing. (4) Never paste your fix into production until you can articulate the failure mode in one sentence. (5) When stuck after 30 minutes, ask Robert. Cost of asking: 0. Cost of cascading the wrong fix: hours.
🎯 15. Integration Points with nexus-vm Stack HIGHEST PRIORITY
The longest section by design. Everything above maps to abstractions; this maps to the actual production stack at 34.55.179.122 and the systems it touches. If a future Robert reads only one section to recover from a disaster, this is the one.
15.1 The current stack, in one screen
| Layer | Component | State today | Where it lives |
|---|---|---|---|
| DNS & Edge | Cloudflare zone a87220882ed631dd4dfb | Production | Cloudflare |
| Compute | GCE VM nexus-vm | Production, single VM | us-central1-a, IP 34.55.179.122 |
| Filesystem | /opt/nexus Python framework | Production | nexus-vm boot disk |
| HTTP service | Context Harness on localhost:8765 | Production | nexus-vm, systemd-managed |
| RAG store | Zeus, 19,679 chunks, text-embedding-3-small | Production | nexus-vm filesystem + index |
| Web UI | morpheus.callbrightside.com | Production | nexus-vm + Cloudflare |
| WP integration | claude-api → bricks.callbrightside.com WP REST | Production | Hostinger u227696829 |
| SSH | ~/.ssh/google_compute_engine + dovew user | Production | nexus-vm metadata |
| Secrets (today) | OS env vars / .env files on VM | To migrate | To Secret Manager |
| Backups (today) | None automated | To create | To GCS + snapshot policy |
15.2 SSH access patterns, where we are vs where we should be
Today. Robert SSHes via ssh -i ~/.ssh/google_compute_engine dovew@34.55.179.122. The public key is in instance metadata under ssh-keys. This works but has three weaknesses: (a) the VM has a public IP that any bot can scan, (b) revoking access requires editing metadata, (c) audit logs show dovew not robert.dove@callbrightside.com.
Bulletproof target. No public IP. Access via IAP TCP forwarding (Section 1.9.3) gated by OS Login (Section 1.9.1). Audit logs show the Google identity. Revoking access is one IAM binding removal.
📝 Code, the migration plan
# 1. Grant Robert OS Login + IAP roles
gcloud projects add-iam-policy-binding bsp-prod \
--member=user:robert.dove@callbrightside.com --role=roles/compute.osAdminLogin
gcloud projects add-iam-policy-binding bsp-prod \
--member=user:robert.dove@callbrightside.com --role=roles/iap.tunnelResourceAccessor
# 2. Add the IAP firewall rule
gcloud compute firewall-rules create allow-ssh-iap \
--network=bsp-prod-vpc --direction=INGRESS --action=ALLOW \
--rules=tcp:22 --source-ranges=35.235.240.0/20 --target-tags=ssh-iap
# 3. Test IAP works while public IP is still attached
gcloud compute ssh nexus-vm --zone=us-central1-a --tunnel-through-iap
# 4. Enable OS Login per-instance
gcloud compute instances add-metadata nexus-vm \
--zone=us-central1-a --metadata enable-oslogin=TRUE
# 5. After 7 days of stable IAP-only operation, drop the public IP
gcloud compute instances delete-access-config nexus-vm \
--zone=us-central1-a --access-config-name="External NAT"
⚠️ Gotcha, do not drop the public IP without first wiring up the LB
If the VM goes private, public traffic for morpheus.callbrightside.com cannot reach it directly. You need a global Application Load Balancer with the VM as a backend, the LB has the public IP, and the VM only accepts traffic from the LB plus IAP. Plan the LB before pulling the IP.
15.3 Service accounts for nexus-vm and external integration
Recommended SA design:
nexus-runner@bsp-prod.iam.gserviceaccount.com, attached to the VM. Roles:- roles/secretmanager.secretAccessor on each app secret (scope to specific secrets, not project-wide)
- roles/storage.objectAdmin on the backup bucket only
- roles/logging.logWriter
- roles/monitoring.metricWriter
- roles/cloudtrace.agent
- roles/errorreporting.writer
cf-dns-bot@bsp-prod.iam.gserviceaccount.com, used by automation that touches Cloudflare. Permissions live in Cloudflare's API tokens, GCP role only needsroles/secretmanager.secretAccessoron the Cloudflare token secret.wp-integration@bsp-prod.iam.gserviceaccount.com, identity for code that calls the Hostinger WP REST API. Stores BRICKS_WP_APP_PASSWORD in Secret Manager.
📝 Code, attach a fresh SA to nexus-vm
# Create the SA
gcloud iam service-accounts create nexus-runner --display-name="Nexus VM Runner"
# Grant the secrets it needs
for secret in ANTHROPIC_API_KEY CLOUDFLARE_API_TOKEN BRICKS_WP_APP_PASSWORD VAPI_API_KEY OPENAI_API_KEY; do
gcloud secrets add-iam-policy-binding $secret \
--member=serviceAccount:nexus-runner@bsp-prod.iam.gserviceaccount.com \
--role=roles/secretmanager.secretAccessor
done
# Switch the VM (requires VM stop)
gcloud compute instances stop nexus-vm --zone=us-central1-a
gcloud compute instances set-service-account nexus-vm \
--zone=us-central1-a \
--service-account=nexus-runner@bsp-prod.iam.gserviceaccount.com \
--scopes=cloud-platform
gcloud compute instances start nexus-vm --zone=us-central1-a
15.4 GCS as the backup destination for /opt/nexus
Pick or create a regional bucket in us-central1: gs://bsp-nexus-backups. Versioning on, lifecycle to Nearline at 30 days, Coldline at 90, delete at 365 (Section 4.2). UBLA on. Restrict IAM to the nexus-runner SA + a humans-only audit role.
📝 Code, daily /opt/nexus backup script
# /opt/nexus/scripts/backup_daily.sh
#!/bin/bash
set -euo pipefail
DATE=$(date +%Y%m%d_%H%M%S)
ARCHIVE=/tmp/nexus-${DATE}.tar.zst
tar --zstd -cf $ARCHIVE \
--exclude='/opt/nexus/.git' \
--exclude='/opt/nexus/**/__pycache__' \
--exclude='/opt/nexus/**/*.pyc' \
/opt/nexus
gcloud storage cp $ARCHIVE gs://bsp-nexus-backups/daily/${DATE}.tar.zst
rm $ARCHIVE
# Optional: write a sentinel for the latest successful run
gcloud storage cp /dev/stdin gs://bsp-nexus-backups/_latest.txt <<< "$DATE"
# Cron entry
# 0 3 * * * /opt/nexus/scripts/backup_daily.sh >> /var/log/nexus-backup.log 2>&1
15.5 Cloud SQL evaluation for the WP database
The WordPress staging at bricks.callbrightside.com runs on Hostinger's MySQL. If we ever decide to bring WP on-platform (full GCP), the path is:
- Create a Cloud SQL MySQL 8.0 instance, 2 vCPU, 8 GB RAM, 100 GB SSD, HA enabled, automated backups, PITR enabled, maintenance window Saturday 03:00 UTC.
- Migrate via Database Migration Service (DMS). Set up continuous replication, validate, cutover.
- Update
wp-config.phpon a GCE-hosted PHP setup or App Engine to point at the Cloud SQL Auth Proxy socket. - Use Secret Manager for the DB password.
- Take Cloud SQL backups daily, export weekly to GCS for cross-region DR.
Estimated incremental cost: ~$120/month for the HA Cloud SQL plus ~$5 storage. Decision deferred until WP scale or compliance forces it.
15.6 GCE firewall rules and hardening
The minimum firewall set for nexus-vm in production:
| Name | Direction | Source | Ports | Targets |
|---|---|---|---|---|
| allow-ssh-iap | INGRESS | 35.235.240.0/20 | tcp:22 | tag ssh-iap |
| allow-https-cf | INGRESS | Cloudflare CIDR list | tcp:443 | tag web |
| allow-internal | INGRESS | 10.10.0.0/24 | all | VPC-internal |
| deny-all-ingress | INGRESS | 0.0.0.0/0 | all | (catch-all, priority 65534) |
OS-level hardening additionally: ufw or nftables mirroring the GCP firewall, fail2ban for SSH, automatic unattended upgrades enabled, root SSH disabled, password auth disabled, public key only.
📝 Code, baseline OS hardening on Debian/Ubuntu
sudo apt update && sudo apt install -y unattended-upgrades fail2ban
sudo dpkg-reconfigure -plow unattended-upgrades
# /etc/ssh/sshd_config tweaks
PermitRootLogin no
PasswordAuthentication no
KbdInteractiveAuthentication no
ChallengeResponseAuthentication no
ClientAliveInterval 300
ClientAliveCountMax 2
sudo systemctl reload ssh
15.7 Static IP status and snapshots schedule
📝 Code, verify nexus-vm IP is static, then create a snapshot policy
# Check static IP
gcloud compute addresses list --filter="address=34.55.179.122"
# Promote ephemeral to static if needed
gcloud compute addresses create nexus-vm-static \
--addresses=34.55.179.122 --region=us-central1
# Create a daily snapshot schedule, retain 7 daily / 4 weekly / 12 monthly
gcloud compute resource-policies create snapshot-schedule nexus-daily \
--region=us-central1 \
--max-retention-days=14 \
--start-time=07:00 --daily-schedule \
--on-source-disk-delete=keep-auto-snapshots \
--storage-location=us
# Attach to the boot disk
gcloud compute disks add-resource-policies nexus-vm \
--zone=us-central1-a --resource-policies=nexus-daily
15.8 Load balancer + GCE backend for production scale
If we add a global Application Load Balancer for morpheus.callbrightside.com:
- Frontend: HTTPS, managed cert for morpheus.callbrightside.com, Cloud Armor policy attached.
- Backend: instance group of size 1 containing nexus-vm. Health check on
:8765/healthz(the Context Harness exposes this). - URL map: a single default backend service today, can route paths later.
- Cloudflare in front of the LB or removed; pick one CDN.
Benefit: TLS terminates on Google, can drop nexus-vm public IP, auto-scale to a MIG of 2 when needed without architecture rework.
15.9 Backup strategies, layered
| Layer | Frequency | RPO | RTO | Method |
|---|---|---|---|---|
| Boot disk snapshots | Daily | 24h | 15-30 min | Snapshot schedule (Section 15.7) |
| /opt/nexus tarball | Daily | 24h | 5-10 min | Cron + GCS (Section 15.4) |
| Git remote | On every push | Minutes | 1-2 min | GitHub origin |
| Secrets | On rotation | ~immediate | 1 min | Secret Manager versions |
| External APIs | N/A | N/A | N/A | Service-side responsibility (Hostinger, Cloudflare) |
15.10 Monitoring nexus-vm via Ops Agent
See Section 6.7 for the Ops Agent install. The minimum alerts to wire up:
- Instance up = 0 for 3 minutes
- Disk usage > 85% for 5 minutes
- Memory usage > 90% for 5 minutes
- Context Harness :8765 uptime check fails for 2 minutes
- morpheus.callbrightside.com uptime check fails for 2 minutes
- Error rate from /opt/nexus logs > 5/min for 5 minutes
- Daily snapshot did not complete (custom log-based metric on snapshot job)
15.11 Secret Manager rotation for app secrets
Critical secrets to rotate on schedule:
BRICKS_WP_APP_PASSWORD, application password in WordPress for the claude-api integration. Rotate every 90 days.CLOUDFLARE_API_TOKEN, scoped to zone a87220882ed631dd4dfb. Rotate every 90 days.ANTHROPIC_API_KEY, the Anthropic API key powering Daniel AI / Nexus calls. Rotate every 90 days.VAPI_API_KEY, Vapi (Daniel AI on (913) 963-9817, assistant e2920d04). Rotate every 90 days.OPENAI_API_KEY(text-embedding-3-small for Zeus), rotate every 90 days.
Rotation pattern: add new version, update consumer code to read latest, monitor for ~24 hours, disable old version (do not destroy yet), verify, destroy after 7 days.
15.12 Disaster recovery, nexus-vm dies, how to rebuild
Assume the VM is gone (deleted, host failure, regional outage). Recovery procedure, ordered:
- Confirm in the console that the instance is in fact gone, not just stopped (
gcloud compute instances list --filter=name=nexus-vm). - If the instance was deleted but the boot disk was retained (
--keep-disks=bootflag was on, default not), restart from the disk:gcloud compute instances create nexus-vm --zone=us-central1-a --boot-disk=nexus-vmwith the prior SA, tags, network. - If the boot disk is gone, restore from the latest snapshot:
gcloud compute disks create nexus-vm --source-snapshot=nexus-daily-LATEST --zone=us-central1-athen create instance pointing at it. - Reattach the static IP
34.55.179.122via--address=nexus-vm-static. - Validate that
/opt/nexusis intact, runsystemctl status nexus.service context-harness.service. - If the entire region is down, restore in a different zone of us-central1; if all of us-central1 is down, the snapshot is multi-regional so you can build in us-east1 (different external IP, update Cloudflare DNS).
- Smoke test: curl
https://morpheus.callbrightside.com, run a Zeus search, check Context Harness/healthz. - Rotate the Anthropic, Cloudflare, and BRICKS_WP_APP_PASSWORD secrets just in case the disaster was a credential compromise.
📝 Code, the fast rebuild script (run from any machine with gcloud)
#!/bin/bash
set -euo pipefail
PROJECT=bsp-prod
ZONE=us-central1-a
VM=nexus-vm
SA=nexus-runner@${PROJECT}.iam.gserviceaccount.com
# 1. Find latest snapshot
LATEST=$(gcloud compute snapshots list \
--filter="name~nexus-daily AND status=READY" \
--sort-by=~creationTimestamp --limit=1 --format="value(name)")
echo "Restoring from snapshot: $LATEST"
# 2. Recreate boot disk
gcloud compute disks create $VM \
--source-snapshot=$LATEST --zone=$ZONE --type=pd-balanced
# 3. Create instance from existing disk
gcloud compute instances create $VM \
--zone=$ZONE --machine-type=n2-standard-2 \
--disk=name=${VM},boot=yes,auto-delete=yes \
--service-account=$SA --scopes=cloud-platform \
--address=nexus-vm-static \
--tags=web,ssh-iap \
--shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring
# 4. Wait for boot, smoke test
sleep 60
gcloud compute ssh $VM --zone=$ZONE --tunnel-through-iap --command="systemctl status nexus.service"
15.13 Complete nexus-vm production architecture
✅ Production checklist, nexus-vm integration
- OS Login enabled, IAP TCP forwarding tested, plan to drop public IP
- Dedicated nexus-runner SA attached, default Compute SA detached
- Daily snapshot schedule on boot disk (14d retention)
- Daily /opt/nexus tarball to GCS regional bucket with lifecycle
- All app secrets in Secret Manager with 90d rotation cadence
- Ops Agent installed, /opt/nexus/logs ingested into Cloud Logging
- Uptime checks for morpheus.callbrightside.com and Context Harness :8765
- Static IP 34.55.179.122 promoted/named, not ephemeral
- Firewall rules: allow-ssh-iap, allow-https-cf, allow-internal, deny-all 65534
- OS hardening: unattended-upgrades, fail2ban, root login off, password auth off
- DR script tested quarterly, last test logged in MH
- Monthly cost review against the $72 baseline
🎓 FOR NEW HIRE, the nexus-vm onboarding lap
Day 1: SSH to nexus-vm, cd /opt/nexus, run ls, git status, systemctl status nexus.service context-harness.service. Day 2: open morpheus.callbrightside.com and click around, run a Zeus search via the harness. Day 3: read this section end-to-end. Day 5: shadow Robert through the daily ops loop. Week 2: own the daily backup verification (does the GCS bucket have today's tarball). Week 3: own a non-critical change, write a Master History entry. Month 2: lead a DR drill end-to-end with Robert observing.
Appendices
Appendix A. gcloud CLI cheatsheet for single-VM ops
| Action | Command |
|---|---|
| Configure account/project | gcloud auth login · gcloud config set project bsp-prod · gcloud config set compute/zone us-central1-a |
| List VMs | gcloud compute instances list |
| Describe nexus-vm | gcloud compute instances describe nexus-vm --zone=us-central1-a |
| SSH | gcloud compute ssh nexus-vm --zone=us-central1-a [--tunnel-through-iap] |
| Stop / start | gcloud compute instances stop nexus-vm · gcloud compute instances start nexus-vm |
| Resize | gcloud compute instances set-machine-type nexus-vm --machine-type=n2-standard-4 (stopped VM) |
| Resize disk | gcloud compute disks resize nexus-vm --size=100GB then resize2fs |
| Snapshot | gcloud compute disks snapshot nexus-vm --snapshot-names=manual-$(date +%Y%m%d) |
| List snapshots | gcloud compute snapshots list --filter="name~nexus" |
| List firewall rules | gcloud compute firewall-rules list |
| Add firewall rule | gcloud compute firewall-rules create NAME --allow=tcp:443 --source-ranges=... |
| List addresses | gcloud compute addresses list |
| Reserve static IP | gcloud compute addresses create NAME --addresses=IP --region=us-central1 |
| Read serial port | gcloud compute instances get-serial-port-output nexus-vm |
| List service accounts | gcloud iam service-accounts list |
| Get IAM policy on project | gcloud projects get-iam-policy bsp-prod |
| Add IAM binding | gcloud projects add-iam-policy-binding bsp-prod --member=... --role=... |
| List secrets | gcloud secrets list |
| Read latest secret | gcloud secrets versions access latest --secret=NAME |
| Add secret version | echo -n "VAL" | gcloud secrets versions add NAME --data-file=- |
| Tail logs | gcloud logging read 'resource.type="gce_instance"' --limit=50 --order=desc |
| Stream logs | gcloud alpha logging tail 'resource.type="gce_instance"' |
| List buckets | gcloud storage buckets list |
| Copy to GCS | gcloud storage cp file.tar.gz gs://bsp-nexus-backups/ |
| Download from GCS | gcloud storage cp gs://bsp-nexus-backups/latest.tar.gz . |
| List enabled APIs | gcloud services list --enabled |
| Run a connectivity test | gcloud network-management connectivity-tests create ... |
| Show quotas | gcloud compute regions describe us-central1 --format='value(quotas)' |
| Billing info | gcloud billing projects describe bsp-prod |
Appendix B. IAM roles → permissions matrix (single-VM relevant)
| Role | Key permissions | Use |
|---|---|---|
| roles/compute.osLogin | compute.instances.osLogin | SSH as a regular user via OS Login |
| roles/compute.osAdminLogin | compute.instances.osAdminLogin | SSH as sudo via OS Login |
| roles/iap.tunnelResourceAccessor | iap.tunnelInstances.accessViaIAP | SSH through IAP tunnel |
| roles/compute.instanceAdmin.v1 | compute.instances.* (start, stop, delete, set-machine-type) | Manage VM lifecycle |
| roles/compute.storageAdmin | compute.disks.*, compute.snapshots.* | Disks and snapshots |
| roles/compute.networkAdmin | compute.networks.*, compute.firewalls.*, compute.routers.* | VPC and firewalls |
| roles/storage.objectViewer | storage.objects.get, list | Read GCS objects |
| roles/storage.objectAdmin | storage.objects.* | Read/write GCS objects (bucket-scope) |
| roles/storage.admin | storage.* (incl. buckets) | Bucket admin, dangerous in prod |
| roles/secretmanager.secretAccessor | secretmanager.versions.access | Read latest/specific version |
| roles/secretmanager.secretVersionManager | secretmanager.versions.add, disable | Rotate secrets |
| roles/secretmanager.admin | secretmanager.* | Create/delete secrets |
| roles/cloudkms.cryptoKeyEncrypterDecrypter | cloudkms.cryptoKeyVersions.useToEncrypt/Decrypt | Use a key |
| roles/logging.logWriter | logging.logEntries.create | Write log entries |
| roles/logging.viewer | logging.logEntries.list | Read logs |
| roles/monitoring.metricWriter | monitoring.timeSeries.create | Write custom metrics |
| roles/monitoring.viewer | monitoring.* read | View dashboards |
| roles/monitoring.editor | monitoring.* write | Edit dashboards, alerts |
| roles/cloudtrace.agent | cloudtrace.traces.patch | Send traces |
| roles/errorreporting.writer | errorreporting.errorEvents.create | Send error events |
| roles/cloudsql.client | cloudsql.instances.connect | Connect through Cloud SQL Auth Proxy |
| roles/cloudbuild.builds.editor | cloudbuild.builds.* | Run Cloud Build |
| roles/iam.serviceAccountTokenCreator | iam.serviceAccounts.signBlob, getAccessToken | Sign on behalf of an SA |
| roles/iam.workloadIdentityUser | iam.serviceAccounts.getOpenIdToken | WIF target binding |
| roles/run.invoker | run.routes.invoke | Call a private Cloud Run service |
| roles/owner | everything | Avoid in production |
| roles/editor | almost everything except IAM | Avoid in production |
| roles/viewer | read most resources | OK for read-only humans |
Appendix C. Troubleshooting decision trees
C.1 VM unreachable via SSH
C.2 API returning 403
C.3 Costs spiking
C.4 Logs not appearing
Appendix D. Cost calculator examples (single-VM scenarios)
D.1 Baseline nexus-vm, n2-standard-2, 50 GB pd-balanced, 24/7
| Component | Quantity | Unit price | Monthly |
|---|---|---|---|
| n2 vCPU (us-central1) | 2 x 730 hr | $0.0317/hr | $46.30 |
| n2 RAM (us-central1) | 8 GiB x 730 hr | $0.00425/hr | $24.83 |
| SUD ~10% (auto) | applied | - | -$7.10 |
| pd-balanced | 50 GB | $0.10/GB-mo | $5.00 |
| Static IP (in-use) | 730 hr | free while in-use | $0.00 |
| Egress (light) | 10 GB | $0.085/GB | $0.85 |
| GCS backup (Standard) | 50 GB | $0.020/GB-mo | $1.00 |
| Snapshot storage | ~30 GB | $0.026/GB-mo | $0.78 |
| Logging | 5 GiB ingest | free under 50 GiB | $0.00 |
| Secret Manager | 10 active, 1k ops | ~$0.06/secret-mo | $0.60 |
| Subtotal | $72.26 |
D.2 Baseline + 1-yr CUD on N2 (2 vCPU + 8 GiB)
| Component | Effect | Delta |
|---|---|---|
| 1-yr CUD on n2 vCPU + RAM | ~37% off | -$22.70 |
| SUD does not stack | replace SUD | +$7.10 |
| Adjusted total | ~$56.66/mo |
D.3 Baseline + Cloud SQL for WP (HA, db-custom-2-8192, 100 GB)
| Component | Monthly add |
|---|---|
| Cloud SQL HA, 2 vCPU + 8 GiB | ~$190 |
| Storage 100 GB SSD | ~$17 |
| Backups (auto) | ~$5 |
| Adjusted total | ~$284/mo |
D.4 Baseline + Global Application Load Balancer
| Component | Monthly add |
|---|---|
| Forwarding rule (1) | ~$18 (+ data processing) |
| Cloud Armor base + WAF rules | ~$5/policy + per-request |
| Egress through LB | $0.012/GB additional + standard egress |
| Adjusted total | ~$95-110/mo |
Appendix E. Glossary
- ADC
- Application Default Credentials, the discovery order Google client libraries follow.
- API
- Application Programming Interface, here the HTTP/gRPC service endpoint Google exposes.
- Artifact Registry
- Google's package and container image repository, successor to Container Registry.
- BQ
- BigQuery, Google's serverless data warehouse.
- CDN
- Content Delivery Network, here Cloud CDN or Cloudflare.
- CEL
- Common Expression Language, used in IAM conditions and org policy custom constraints.
- CIDR
- Classless Inter-Domain Routing, an IP range like 10.10.0.0/24.
- CMEK
- Customer-Managed Encryption Key, key in your Cloud KMS used to encrypt a resource.
- Cloud Armor
- Google's WAF + DDoS protection for the Application Load Balancer.
- Cloud Build
- Hosted CI service.
- Cloud Run
- Serverless container service, scales 0 to N.
- Cloud Shell
- Browser-based Linux shell pre-loaded with gcloud.
- CSEK
- Customer-Supplied Encryption Key, raw key bytes per request, mostly deprecated.
- CUD
- Committed Use Discount, 1- or 3-year commitment for compute pricing.
- DLP
- Data Loss Prevention, now Sensitive Data Protection.
- DR
- Disaster Recovery, the practice of rebuilding after major failure.
- Eventarc
- Event router that bridges audit logs and Pub/Sub into Cloud Run.
- GA
- Generally Available, the highest stability level for a Google product.
- GCE
- Google Compute Engine, the IaaS VM service.
- GCS
- Google Cloud Storage, the object store.
- GKE
- Google Kubernetes Engine, managed K8s.
- HA
- High Availability, here a regional Cloud SQL configuration with synchronous standby.
- HCL
- HashiCorp Configuration Language, the syntax of Terraform.
- HSM
- Hardware Security Module, dedicated cryptographic hardware.
- IAM
- Identity and Access Management.
- IAP
- Identity-Aware Proxy, fronts VMs and apps with Google identity auth.
- IaC
- Infrastructure as Code, e.g. Terraform.
- KMS
- Key Management Service.
- LB
- Load Balancer.
- MIG
- Managed Instance Group, an autoscaled cluster of identical VMs.
- MQL
- Monitoring Query Language, advanced query syntax for Cloud Monitoring.
- NCC
- Network Connectivity Center, hub-and-spoke management for VPC and hybrid.
- NIC
- Network Interface Controller. Also Network Intelligence Center.
- OS Login
- SSH access tied to Google identity, IAM-controlled.
- OWASP
- Open Worldwide Application Security Project, source of common rule sets.
- PD
- Persistent Disk, the older block storage family. Hyperdisk is the new family.
- PGA
- Private Google Access, lets a private VM reach googleapis.com via Google's backbone.
- PITR
- Point-In-Time Recovery, restore to any second within retention window.
- PSC
- Private Service Connect, attaches a managed service at a private IP inside your VPC.
- RAG
- Retrieval-Augmented Generation, here the Zeus index of 19,679 chunks.
- RPO
- Recovery Point Objective, the maximum data loss tolerated.
- RTO
- Recovery Time Objective, the maximum downtime tolerated.
- SA
- Service Account, a Google identity for software.
- SCC
- Security Command Center, GCP's posture and findings dashboard.
- SLI
- Service Level Indicator, the metric.
- SLO
- Service Level Objective, the target.
- SSO
- Single Sign-On.
- SSRF
- Server-Side Request Forgery, where a server is tricked into fetching attacker-chosen URLs.
- SUD
- Sustained Use Discount, automatic discount for monthly compute usage.
- TF
- Terraform.
- UBLA
- Uniform Bucket-Level Access, IAM-only access control on a GCS bucket.
- VPC
- Virtual Private Cloud, the global software-defined network.
- VPC-SC
- VPC Service Controls, a security perimeter around managed services.
- WAF
- Web Application Firewall.
- WIF
- Workload Identity Federation, keyless auth from outside-GCP workloads.
Appendix F. Quick reference card
Project: bsp-prod · Region: us-central1 · Zone: us-central1-a · VM: nexus-vm · IP: 34.55.179.122
SSH today: ssh -i ~/.ssh/google_compute_engine dovew@34.55.179.122 · SSH bulletproof: gcloud compute ssh nexus-vm --zone=us-central1-a --tunnel-through-iap
Daily ops loop: systemctl status nexus.service context-harness.service · journalctl -u nexus.service -n 100 · df -h · free -m
Backup verification: gcloud storage ls gs://bsp-nexus-backups/daily/ | tail -3 · gcloud compute snapshots list --filter="name~nexus-daily" --sort-by=~creationTimestamp --limit=3
Read a secret: gcloud secrets versions access latest --secret=NAME
Tail logs: gcloud logging read 'resource.type="gce_instance"' --limit=50 --order=desc --format="value(timestamp,severity,jsonPayload.message)"
Cost dashboard: Console → Billing → Reports, group by SKU · Status: status.cloud.google.com
Incident first 5 minutes: (1) confirm symptom, (2) check status page, (3) gcloud compute instances describe nexus-vm, (4) Logs Explorer severity>=ERROR, (5) Connectivity Test from Cloudflare CIDR.
DR command: see Section 15.12 fast rebuild script.
Bulletproof rule: never the fast option, always best practice. Read first, build second. Receipts not narration.