GCP Architecture Reference, nexus-vm Production Stack

v1.0 · 2026-04-28 · Generated for nexus-vm production stack

GCP Architecture Reference

A field reference for operating the Bright Side Plumbing nexus-vm production stack on Google Cloud, plus an onboarding map for a new junior web developer joining the team. The doc treats Google Cloud as the operating environment, calls out every place the BSP stack actually touches GCP, and flags the gotchas that bite single-VM workloads. No em dashes anywhere, dark mode by default, and every section closes with a production checklist.

Table of contents

🎓 FOR NEW HIRE, How to read this doc

Welcome. The fastest path to productivity is: skim sections 1, 2, 3, 6, and 15. Sections 1 and 15 cover the actual VM you will SSH into. Section 2 explains how Google decides whether your account or service account is allowed to do something. Section 3 explains why traffic from a browser actually reaches the VM. Section 6 is how we know anything is broken. Section 15 ties it together for our specific stack. Everything else is reference, dive in when you need it. Cloud is mostly Python, with shell glue and the gcloud CLI; Go is the language Google itself uses to build the platform; TypeScript shows up at the edges (Cloudflare Workers, Next.js, Bricks builder). Lean Python first.


🏗️ 1. Compute Engine HIGH PRIORITY

Compute Engine (GCE) is Google Cloud's IaaS layer. Our entire Nexus operational stack runs on a single GCE VM named nexus-vm at external IP 34.55.179.122. Everything in this section is calibrated for single-VM operations. Multi-VM, MIG, and regional patterns are summarized so you can recognize them, not deeply rehearsed.

1.1 The mental model: VM lifecycle and where state lives

A GCE VM is the composition of three independent objects: an instance (CPU/RAM/network attachment), one or more persistent disks (block storage that survives the instance), and a project + zone binding that scopes everything else (firewall rules, IAM, billing). When you "stop" a VM you keep the disks, lose the running RAM, and stop paying for vCPU/RAM but keep paying for disks and reserved static IPs. When you "delete" the instance you can choose to keep or delete each attached disk. Snapshots are the durable backup unit, they live in GCS-backed regional or multi-regional storage and are independent of the disk.

States in Compute Engine: PROVISIONINGSTAGINGRUNNINGSTOPPINGTERMINATED. There is also SUSPENDING/SUSPENDED for the suspend-to-disk flow which preserves RAM contents on a separate disk. Source: cloud.google.com/compute/docs/instances/instance-life-cycle.

⚠️ Gotcha, "stopped" still costs money

A stopped VM costs $0 for compute but you still pay for: attached persistent disks, attached GPUs that are reserved, reserved static external IPs (a static IP unattached to a running VM costs ~$0.005/hr, around $3.65/mo per address), and any committed-use discounts you bought. The "I'll stop the VM over the weekend to save money" play only works if you also detach unused IPs and sized disks correctly.

⚠️ Gotcha, regional vs zonal scoping

VMs and persistent disks are zonal. If us-central1-a goes down, your VM and its zonal PDs are inaccessible. Snapshots and images are multi-regional. Static external IPs are regional. Plan with this in mind: a snapshot can rebuild your VM in any zone, but a zonal disk cannot be cross-zone-attached, you must clone via snapshot. Source: cloud.google.com/compute/docs/regions-zones.

⚠️ Gotcha, instance metadata survives stop/start, not delete

Custom instance metadata (startup-script, ssh-keys, user data) lives on the instance object, not the disk. If you delete and recreate the instance reusing the same disk, the metadata is gone. Capture metadata before destructive operations: gcloud compute instances describe nexus-vm --zone us-central1-a --format='value(metadata)'.

1.2 Machine types: families, sizing, and what to pick

Machine types are grouped into families by workload pattern. The family determines the CPU platform, memory ratio, network bandwidth, and pricing curve.

FamilySeriesWorkload fitvCPU rangeMem/vCPU (GB)Notes
General purposeE2Cheap, web/dev, low constant load2-320.5-8Shared-core (e2-micro/small/medium), CPU platform abstracted
General purposeN2, N2DBalanced, most production workloads2-1280.5-8N2 = Intel, N2D = AMD EPYC
General purposeN42024 GA, Granite Rapids, Hyperdisk-only2-802-4Replaces N2 for new builds, but Hyperdisk only
General purposeC3, C3DConsistent high throughput4-1762-8Sapphire/Genoa, Titanium NIC, Hyperdisk
General purposeC4, C4A2024-25 GA, Emerald Rapids / Axion (Arm)2-1922-8C4A is Google Axion Arm CPU
Compute optimizedC2, C2D, H3HPC, gaming servers, single-thread heavy4-3602-8H3 is HPC-tuned, no live migration
Memory optimizedM1, M2, M3, X4SAP HANA, in-memory DBs40-192014-30X4 = bare metal up to 32 TB RAM
Storage optimizedZ3Local NVMe-heavy, OLAP, search88-1768Up to 36 TB Local SSD
Accelerator optimizedA2, A3, G2GPU/ML, video transcoding12-208variesA2=A100, A3=H100/H200, G2=L4

For single-VM operations like nexus-vm, the practical universe is E2, N2/N2D, C3. E2 if you want maximum cost efficiency and your workload is bursty. N2 if you want predictable performance with broad disk type support. C3 if you need consistent high throughput, but be aware C3 forces you onto Hyperdisk Balanced, which has different pricing than PD-Balanced.

📝 Code, list machine types in our zone
gcloud compute machine-types list \
  --filter="zone:us-central1-a AND name~'^(e2|n2)-'" \
  --format="table(name,guestCpus,memoryMb,maximumPersistentDisksSizeGb)" \
  --sort-by="guestCpus,memoryMb"
⚠️ Gotcha, E2 shared-core is fine until it isn't

e2-micro / e2-small / e2-medium share a physical core with other tenants and use a CPU credit bucket. If your workload sustains above the burst baseline (25% for micro, 50% for small, 100% baseline only on medium when credits are full), it gets throttled. For nexus-vm running Python automation that occasionally spikes to do bulk embeddings, an E2 shared-core is the wrong move. Use e2-standard-2 at minimum, or N2 for predictable scheduling.

1.3 Custom machine types and sustained/committed use discounts

N1, N2, N2D, and E2 families allow custom CPU/RAM ratios. You pay vCPU and RAM independently. Useful when your workload wants 4 vCPU and 32 GB, which is twice the memory of n2-standard-4 (16 GB) and half that of n2-highmem-4 (32 GB). The custom path lets you land on exactly the right shape without overprovisioning.

Two automatic discount programs apply with no opt-in needed:

💡 Insight, CUDs for a single VM are still worth it

Even for a single nexus-vm, a 1-year resource-based CUD on the exact vCPU/RAM count typically pays back inside ~7 months. The risk is being locked into a region and having to keep paying if you move clouds. Numbers in Section 12 (Cost).

1.4 Disk types: PD, Hyperdisk, Local SSD

DiskBackingMax IOPS/diskMax throughputCapacityUse
pd-standardSpinning HDD~7,500 r / 15,000 w~1.2 GB/s10 GB-64 TBCheap archive volumes, batch jobs
pd-balancedSSD, mid tier15,000-80,000240-1,200 MB/s10 GB-64 TBDefault for most VMs, good cost/perf
pd-ssdSSD, premium15,000-100,000240-1,200 MB/s10 GB-64 TBLatency-sensitive DB workloads
pd-extremeSSD, provisioned IOPSup to 120,0002,200 MB/s500 GB-64 TBPredictable extreme IOPS, high cost
hyperdisk-balancedSSD, decoupled IOPS+capacity+throughputup to 350,0005,000 MB/s4 GB-64 TBRequired on N4/C3/C4, future default
hyperdisk-extremeSSDup to 500,00010,000 MB/s64 GB-64 TBSAP HANA, high-end DBs
hyperdisk-throughputHDD-priced, throughput-tunedlowup to 600 MB/s2 TB-32 TBBig sequential reads, log archives
Local SSDNVMe attached to hostup to 9M (aggregate)tens of GB/s375 GB incrementsScratch/cache, ephemeral, lost on stop
🔥 Recency, PD-Standard sunset path

Google has been steering customers off pd-standard. New machine families (N4, C4) do not allow it. For nexus-vm on N2 the option still exists, but pd-balanced is the default for new boot disks and it is rare for the cost difference to justify pd-standard. Source: cloud.google.com/compute/docs/disks.

⚠️ Gotcha, performance scales with disk size

For pd-balanced and pd-ssd, IOPS and throughput scale with capacity. A 100 GB pd-balanced disk caps at ~3,000 IOPS no matter what your VM is. If your DB feels slow, oversize the disk. The size-IOPS curve is documented at cloud.google.com/compute/docs/disks/performance.

⚠️ Gotcha, Local SSD is volatile

Local SSD is physically attached to the host machine. If the VM is stopped, terminated, live-migrated, or the host fails, the data is gone. Use Local SSD only for: ephemeral cache, RAID arrays where the application replicates data elsewhere, and scratch space for batch jobs. Never put a primary database on Local SSD without external replication.

1.5 OS images, Shielded VMs, Confidential VMs

Google maintains a catalog of public images: Debian (default for many tutorials), Ubuntu LTS (12, 14, 16, 18, 20, 22, 24), Rocky Linux 8/9, RHEL 7/8/9, CentOS Stream, SUSE, Windows Server (2016, 2019, 2022, 2025). Each project also gets a private image catalog for custom images you build with Packer or gcloud compute images create. The nexus-vm currently runs Debian (verify with cat /etc/os-release).

📝 Code, list latest Ubuntu 24.04 LTS images
gcloud compute images list \
  --project=ubuntu-os-cloud \
  --filter="family:ubuntu-2404-lts AND status=READY" \
  --sort-by=~creationTimestamp \
  --limit=3

Shielded VM adds three layers: secure boot (UEFI verifies signed firmware), virtual TPM (vTPM for measured boot, attestation), and integrity monitoring (each boot is checksummed and the dashboard shows drift). On by default for newer Google-published images. Costs nothing extra. Source: cloud.google.com/security/shielded-cloud/shielded-vm.

Confidential VM goes further: memory is encrypted in use using AMD SEV (N2D, C2D), AMD SEV-SNP (C3D), Intel TDX (C3), or NVIDIA H100 GPU memory protection (A3). Adds ~5-10% perf overhead on most workloads. Required for processing strongly regulated data on shared infrastructure. Source: cloud.google.com/confidential-computing.

⚠️ Gotcha, custom images and the kernel surprise

If you build a custom image from a Debian VM and apply it to a new VM, you may inherit a kernel pinned to the source machine type. When you create the new VM with a different machine family, the guest tools may fail to detect the new NIC or NVMe driver. The fix is to install google-osconfig-agent, google-cloud-sdk, and the google-compute-engine guest environment package before imaging. Or use gcloud compute images import which automates the conversion.

1.6 Metadata service: 169.254.169.254

Every GCE VM has a magic link-local IP 169.254.169.254 that serves project and instance metadata over HTTP. This is the same pattern as AWS EC2 IMDS but with key differences. Google's metadata service requires the header Metadata-Flavor: Google on every request, which prevents accidental exposure if a web server proxies untrusted user input.

📝 Code, metadata service queries from inside the VM
# Project ID
curl -H "Metadata-Flavor: Google" \
  http://metadata.google.internal/computeMetadata/v1/project/project-id

# Default service account email
curl -H "Metadata-Flavor: Google" \
  http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/email

# Default SA OAuth access token (auto-refreshed by Google)
curl -H "Metadata-Flavor: Google" \
  http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token

# SSH keys configured on the project (NOT used if OS Login is on)
curl -H "Metadata-Flavor: Google" \
  http://metadata.google.internal/computeMetadata/v1/project/attributes/ssh-keys

# Instance ID, zone, machine type
curl -H "Metadata-Flavor: Google" \
  http://metadata.google.internal/computeMetadata/v1/instance/?recursive=true&alt=json | jq
⚠️ Gotcha, SSRF and the metadata service

If your application proxies arbitrary URLs and runs on a GCE VM, you have an SSRF vector to 169.254.169.254. The required Metadata-Flavor: Google header used to be only required for v1 endpoints; it is now mandatory for all responses. But also lock down the egress URL allowlist or block link-local IPs at the application layer.

1.7 Startup scripts and shutdown scripts

Two metadata keys give you boot- and shutdown-time hooks: startup-script (or startup-script-url pointing to GCS) and shutdown-script. Startup scripts run as root every time the VM boots. Shutdown scripts run on a graceful stop with a 90-second timeout, after which the VM is force-stopped. Output goes to the serial console (gcloud compute instances get-serial-port-output nexus-vm) and to journalctl -u google-startup-scripts.service.

📝 Code, set a startup script that installs Ops Agent
gcloud compute instances add-metadata nexus-vm \
  --zone=us-central1-a \
  --metadata=startup-script='#!/bin/bash
set -euo pipefail
if ! systemctl is-active --quiet google-cloud-ops-agent; then
  curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
  bash add-google-cloud-ops-agent-repo.sh --also-install
fi
systemctl enable google-cloud-ops-agent
systemctl start google-cloud-ops-agent'
⚠️ Gotcha, startup scripts run on every boot

Including when you stop and start the VM, including after a live migration in some cases. Make every startup script idempotent. The pattern systemctl is-active --quiet X || install_X is your friend. Do not put one-time bootstrap (project creation, DB init) in a startup script unless you guard it with a sentinel file.

1.8 Live migration, spot, preemptible

By default, Google migrates your VM live to another host for maintenance with no downtime (typically <1 second pause). The default availability policy onHostMaintenance=MIGRATE is correct for production. The other option is TERMINATE, which is used for GPU/TPU machines and Spot VMs.

Spot VMs (the modern name) and Preemptible VMs (the legacy name, capped at 24 hours) are deeply discounted (60-91% off) compute that Google can reclaim with 30 seconds notice. Use for: stateless batch, fault-tolerant queues, CI runners. Do not use for: a single VM hosting your only production stack. Source: cloud.google.com/compute/docs/instances/spot.

🔥 Recency, preemptible VMs are deprecated for new use

The 24-hour-capped legacy preemptible VMs are still functional but Google steers everyone to Spot VMs (no time cap, more flexible reclaim contract). New automation should set --provisioning-model=SPOT and not --preemptible. Source: cloud.google.com/compute/docs/instances/preemptible.

1.9 SSH access methods (HIGH PRIORITY)

This is the section you will reread most. Three independent ways to SSH to a GCE VM, with very different security postures.

1.9.1 Method A, OS Login

Recommended default. SSH keys are tied to your Google identity (the email you log into the Cloud Console with), enforced via the roles/compute.osLogin or roles/compute.osAdminLogin IAM roles. Keys are pushed to the VM by the OS Login agent on boot. Revoking access is instant: remove the IAM binding and the key is removed on next sync.

📝 Code, enable OS Login at the project level
# Project-wide
gcloud compute project-info add-metadata \
  --metadata enable-oslogin=TRUE

# Per-instance override
gcloud compute instances add-metadata nexus-vm \
  --zone=us-central1-a \
  --metadata enable-oslogin=TRUE

# Grant SSH access (regular user)
gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="user:robert.dove@callbrightside.com" \
  --role="roles/compute.osLogin"

# Grant sudo access
gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="user:robert.dove@callbrightside.com" \
  --role="roles/compute.osAdminLogin"

1.9.2 Method B, Metadata SSH keys (legacy)

The traditional path. Each VM (or the project) carries an ssh-keys metadata entry containing public keys with a username:ssh-rsa AAAA... format. The Google Compute Engine guest agent picks these up and writes them into /home/<user>/.ssh/authorized_keys on each boot. This is what ~/.ssh/google_compute_engine is wired up to.

📝 Code, the BSP standard SSH path
# From Robert's local machine
ssh -i ~/.ssh/google_compute_engine dovew@34.55.179.122

# Or via gcloud (auto-handles keys, OS Login if enabled, IAP otherwise)
gcloud compute ssh nexus-vm --zone=us-central1-a

# With command, no shell
gcloud compute ssh nexus-vm --zone=us-central1-a --command="uptime"
⚠️ Gotcha, OS Login and metadata keys conflict

If enable-oslogin=TRUE is set, the metadata ssh-keys entry is ignored. You can be locked out if you flip OS Login on without granting yourself the OS Login IAM role. Always grant the role first, verify SSH works, then enable OS Login.

1.9.3 Method C, IAP TCP forwarding

Identity-Aware Proxy TCP forwarding tunnels SSH (and other TCP) through Google's IAP fabric to a VM that has no public IP. The connection authenticates as your Google identity, and the VM's firewall only needs to allow port 22 from the IAP range 35.235.240.0/20. This is the path to a fully private VM that no one on the internet can reach. Source: cloud.google.com/iap/docs/using-tcp-forwarding.

📝 Code, SSH via IAP (no public IP needed)
# gcloud handles the tunnel automatically
gcloud compute ssh nexus-vm --zone=us-central1-a --tunnel-through-iap

# Required IAM
gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="user:robert.dove@callbrightside.com" \
  --role="roles/iap.tunnelResourceAccessor"

# Required firewall rule
gcloud compute firewall-rules create allow-ssh-from-iap \
  --direction=INGRESS --action=ALLOW \
  --rules=tcp:22 --source-ranges=35.235.240.0/20
💡 Insight, the right answer for nexus-vm

Today, nexus-vm is reachable on a public IP 34.55.179.122 with metadata SSH keys. The bulletproof path is OS Login + IAP TCP forwarding + no public IP. The morpheus.callbrightside.com command center can still serve traffic via a load balancer with the VM as the backend. Section 15 walks through the migration.

1.10 Instance groups, autoscalers, and managed templates

For multi-VM patterns: a managed instance group (MIG) spawns identical VMs from an instance template, can autoscale on CPU, custom metric, or schedule, and integrates with backend services for load balancing. A regional MIG spreads VMs across all zones in a region for HA. We don't use MIGs today on nexus-vm but if BSP grows out of single-VM, the migration path is: snapshot the disk, build an instance template from the snapshot, define a MIG with size 1 first, then scale.

📝 Code, build an instance template from current nexus-vm
# Step 1: snapshot the boot disk
gcloud compute disks snapshot nexus-vm \
  --zone=us-central1-a \
  --snapshot-names=nexus-vm-template-$(date +%Y%m%d)

# Step 2: build a custom image
gcloud compute images create nexus-vm-image-v1 \
  --source-snapshot=nexus-vm-template-$(date +%Y%m%d) \
  --family=nexus-vm

# Step 3: create the template
gcloud compute instance-templates create nexus-vm-tpl-v1 \
  --machine-type=n2-standard-2 \
  --image-family=nexus-vm \
  --image-project=PROJECT_ID \
  --tags=http-server,https-server

1.11 gcloud compute reference (single-VM ops)

OperationCommand
Describe nexus-vmgcloud compute instances describe nexus-vm --zone=us-central1-a
Stop / startgcloud compute instances stop nexus-vm --zone=us-central1-a
Resize machine typegcloud compute instances set-machine-type nexus-vm --machine-type=n2-standard-4 --zone=us-central1-a (VM must be stopped)
Resize boot diskgcloud compute disks resize nexus-vm --size=100GB --zone=us-central1-a then resize2fs in the guest
Add a new diskgcloud compute disks create data-1 --size=200GB --type=pd-balanced --zone=us-central1-a then gcloud compute instances attach-disk nexus-vm --disk=data-1 --zone=us-central1-a
Snapshotgcloud compute disks snapshot nexus-vm --zone=us-central1-a --snapshot-names=nexus-vm-$(date +%Y%m%d)
Reset (hard)gcloud compute instances reset nexus-vm --zone=us-central1-a (last resort)
Serial consolegcloud compute instances get-serial-port-output nexus-vm --zone=us-central1-a
Force-detach static IPgcloud compute addresses delete IP_NAME --region=us-central1

1.12 SVG: nexus-vm topology

Internet browsers, scripts Cloudflare zone a87220882ed631dd4dfb GCE VM nexus-vm 34.55.179.122 Hostinger u227696829 WP /opt/nexus python framework Context Harness :8765 Zeus RAG (19,679 chunks) morpheus.callbrightside.com External services used - Anthropic API - Vapi (Daniel AI) - WP REST (claude-api) - Cloudflare API - ServiceTitan, BigSale, QB
Figure 1.1, nexus-vm topology and the boundary between BSP-owned compute (GCE) and external services.
✅ Production checklist, Compute Engine
🎓 FOR NEW HIRE, Compute Engine cheat lines

🔒 2. IAM, Service Accounts, Audit HIGH PRIORITY

IAM (Identity and Access Management) is how Google decides whether an identity (a user, a service account, or a Google group) is allowed to perform an action on a resource. Mastering IAM is the difference between secure, predictable infrastructure and a 3 a.m. incident root cause that reads "the default service account had Owner."

2.1 The hierarchy: organization, folder, project, resource

Resources sit in a four-level hierarchy. IAM bindings attached at any level inherit downward.

A binding is a 3-tuple (member, role, condition?). Members are the identity, roles are bundles of permissions, conditions are optional CEL expressions that gate the binding by request attributes. Source: cloud.google.com/iam/docs/overview.

Organization, callbrightside.com Folder, prod Folder, sandbox Project, bsp-prod Project, bsp-data Project, bsp-sbx VM nexus-vm + GCS + SQL
Figure 2.1, IAM hierarchy. Bindings on the Org cascade to Folders, then Projects, then Resources.
⚠️ Gotcha, inheritance is additive only

IAM bindings add permissions as you go down the tree, never subtract. If you grant Owner at the Org level, you cannot revoke it at the project level. The only way to remove an inherited permission is to remove the higher binding or use a Deny policy (Org Policy + IAM Deny, see 2.10).

2.2 Service accounts, the identity for automation

A service account (SA) is a Google identity owned by a project, used by software (not humans) to authenticate. Email format: NAME@PROJECT_ID.iam.gserviceaccount.com. Every project gets several service accounts created automatically.

Service accountEmail patternPurpose
Compute Engine default SAPROJECT_NUMBER-compute@developer.gserviceaccount.comIdentity assumed by GCE VMs unless overridden
App Engine default SAPROJECT_ID@appspot.gserviceaccount.comApp Engine and Cloud Functions Gen 1 default
Google APIs SAPROJECT_NUMBER@cloudservices.gserviceaccount.comUsed by GCP services to act on your behalf, e.g. Deployment Manager
Cloud Build SAPROJECT_NUMBER@cloudbuild.gserviceaccount.comCloud Build runs builds as this identity
Cloud Run SACompute default unless overriddenService identity for Cloud Run revisions
Pub/Sub SAservice-PROJECT_NUMBER@gcp-sa-pubsub.iam.gserviceaccount.comPub/Sub uses this for push delivery
🔥 Recency, default SA permissions tightened

Before May 2024, the Compute Engine default SA was granted Editor (roles/editor) on the project at creation. As of 2024 organizations have an org policy iam.automaticIamGrantsForDefaultServiceAccounts that defaults to disabled, meaning new projects no longer auto-grant Editor. Verify with gcloud projects get-iam-policy PROJECT_ID. Source: cloud.google.com/iam/docs/service-account-overview.

2.3 Service account keys, the danger zone

You can mint a downloadable JSON key for a service account. This key is a long-lived bearer credential. If it leaks, the holder authenticates as the SA from anywhere on the internet until you rotate the key. Rules of the road:

📝 Code, list and rotate SA keys
# List keys for an SA
gcloud iam service-accounts keys list \
  --iam-account=nexus-runner@PROJECT_ID.iam.gserviceaccount.com

# Create a new key (90-day expiry)
gcloud iam service-accounts keys create new-key.json \
  --iam-account=nexus-runner@PROJECT_ID.iam.gserviceaccount.com

# Disable an old key (preferred over delete during rotation)
gcloud iam service-accounts keys disable KEY_ID \
  --iam-account=nexus-runner@PROJECT_ID.iam.gserviceaccount.com

# Delete after the rollout is verified
gcloud iam service-accounts keys delete KEY_ID \
  --iam-account=nexus-runner@PROJECT_ID.iam.gserviceaccount.com

2.4 Workload Identity Federation (the right answer)

WIF lets a workload outside GCP (GitHub Actions, AWS, Okta, anything that issues an OIDC or SAML token) impersonate a Google service account without ever touching a downloaded key. You configure a workload identity pool, define provider trust (issuer URL, audience, attribute mapping), and grant the external identity roles/iam.workloadIdentityUser on the target SA.

📝 Code, GitHub Actions to GCP without a key
# 1. Create the workload identity pool
gcloud iam workload-identity-pools create gh-pool \
  --location=global \
  --display-name="GitHub Actions"

# 2. Add the GitHub OIDC provider
gcloud iam workload-identity-pools providers create-oidc gh-provider \
  --workload-identity-pool=gh-pool --location=global \
  --issuer-uri="https://token.actions.githubusercontent.com" \
  --attribute-mapping="google.subject=assertion.sub,attribute.repository=assertion.repository" \
  --attribute-condition="assertion.repository=='callbrightside/nexus'"

# 3. Grant the SA's WorkloadIdentityUser role to the GitHub repo subject
gcloud iam service-accounts add-iam-policy-binding \
  ci-deployer@PROJECT_ID.iam.gserviceaccount.com \
  --role=roles/iam.workloadIdentityUser \
  --member="principalSet://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/gh-pool/attribute.repository/callbrightside/nexus"

2.5 Role taxonomy: basic, predefined, custom

TierExamplesUse
Basic (legacy)roles/owner, roles/editor, roles/viewerAvoid in production. Far too broad.
Predefinedroles/compute.instanceAdmin, roles/storage.objectViewer, roles/secretmanager.secretAccessorStandard answer. Use these.
CustomYou define a list of permissionsFor least-privilege when no predefined role fits
📝 Code, build a custom role for the nexus runner
# nexus-runner.yaml
title: "Nexus Runner"
description: "Read GCS, write logs, no admin"
stage: GA
includedPermissions:
  - storage.objects.get
  - storage.objects.list
  - logging.logEntries.create
  - secretmanager.versions.access

gcloud iam roles create nexusRunner \
  --project=PROJECT_ID \
  --file=nexus-runner.yaml

2.6 Permissions you actually need on nexus-vm

ActionRequired permission(s)Predefined role
SSH to nexus-vm via gcloudcompute.instances.get + iap.tunnelInstances.accessViaIAProles/compute.osLogin + roles/iap.tunnelResourceAccessor
Stop/start nexus-vmcompute.instances.stop, compute.instances.startroles/compute.instanceAdmin.v1
Read a GCS bucketstorage.objects.get, storage.objects.listroles/storage.objectViewer
Read a Secret Manager valuesecretmanager.versions.accessroles/secretmanager.secretAccessor
Write log entrieslogging.logEntries.createroles/logging.logWriter
Write metricsmonitoring.timeSeries.createroles/monitoring.metricWriter
Snapshot a diskcompute.disks.createSnapshotroles/compute.storageAdmin

2.7 Conditional bindings (CEL)

You can attach a Common Expression Language condition to any binding. The binding only fires when the request matches.

📝 Code, restrict GCS read access to a specific bucket and time window
gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="user:auditor@callbrightside.com" \
  --role="roles/storage.objectViewer" \
  --condition='expression=resource.name.startsWith("projects/_/buckets/bsp-audit") && request.time < timestamp("2026-12-31T23:59:59Z"),title=audit-window-2026'

2.8 Domain-wide delegation

For Workspace customers, a service account can be granted the right to impersonate any user in the domain for specific OAuth scopes. Used by automation that needs to send mail as a user, read calendars across the org, etc. Configured in admin.google.com under Security → API Controls → Domain-wide Delegation. The SA's "client ID" (numeric, not the email) is whitelisted with a list of scopes. Powerful, dangerous, audit it.

2.9 Audit logs taxonomy

Cloud Audit Logs come in four streams:

📝 Code, query audit logs for IAM changes
gcloud logging read \
  'logName=~"cloudaudit.googleapis.com%2Factivity" AND protoPayload.serviceName="iam.googleapis.com"' \
  --limit=20 --format="table(timestamp,protoPayload.authenticationInfo.principalEmail,protoPayload.methodName,protoPayload.resourceName)"

2.10 IAM Deny policies

2023+ feature. A Deny policy is the only way to prevent a permission regardless of inherited Allow bindings. Attaches at Org, Folder, or Project level. Useful for guardrails like "no one, not even Org Admins, can disable Audit Logging."

📝 Code, deny a specific permission for everyone except a break-glass group
# deny-policy.json
{
  "displayName": "deny-disable-audit-logging",
  "rules": [{
    "deniedPrincipals": ["principalSet://goog/public:all"],
    "exceptionPrincipals": ["principalSet://goog/group/break-glass@callbrightside.com"],
    "deniedPermissions": ["logging.googleapis.com/sinks.delete"]
  }]
}

gcloud iam policies create deny-disable-audit \
  --kind=denypolicies \
  --policy-file=deny-policy.json \
  --attachment-point=cloudresourcemanager.googleapis.com/organizations/ORG_ID

2.11 Troubleshooter, Policy Analyzer, Policy Simulator

✅ Production checklist, IAM
🎓 FOR NEW HIRE, IAM mental model

Every API call to GCP gets stamped "who is asking, what are they trying to do, on what resource." IAM is the lookup table that returns yes or no. The fastest debugging path when you get a 403: copy the exact permission string from the error, search the role catalog (cloud.google.com/iam/docs/understanding-roles), and grant the smallest predefined role that contains it. Never grant Owner to fix something. Once you do, you cannot tell what permission was actually missing.


🌐 3. Networking, VPC, Load Balancing HIGH PRIORITY

Every byte that reaches nexus-vm traverses a chain of network primitives. Understanding the chain is what turns "the site is down" from a 30-minute incident into a 2-minute fix.

3.1 VPC architecture

A VPC (Virtual Private Cloud) is a global, software-defined network. Unlike AWS, where each VPC is regional, Google's VPC spans every region. A subnet, however, is regional. So a single VPC named default typically has one auto-mode subnet per region, each with a non-overlapping CIDR.

VPCs come in two modes:

📝 Code, create a custom-mode VPC and a subnet
gcloud compute networks create bsp-prod-vpc --subnet-mode=custom

gcloud compute networks subnets create nexus-subnet \
  --network=bsp-prod-vpc \
  --region=us-central1 \
  --range=10.10.0.0/24 \
  --enable-private-ip-google-access \
  --enable-flow-logs

3.2 Subnets, secondary ranges, alias IPs

A subnet has a primary IPv4 range used for VM NICs, plus optional secondary ranges used for GKE pod and service IPs (alias IP), or for assigning a /28 to a Cloud SQL Private Service Connection. Source: cloud.google.com/vpc/docs/subnets.

3.3 Firewall rules

VPC firewalls are stateful. A connection initiated from inside the VPC is allowed back in, you do not need a separate egress allow rule. Rules are scoped to a network, evaluated in priority order (lowest number wins, ties decide by allow-deny ordering, deny-wins).

FieldMeaning
DirectionINGRESS or EGRESS
ActionALLOW or DENY
Priority0-65535, lower = higher priority. Default 1000.
Source rangesList of CIDRs (ingress only)
Source tags / SAsRestrict to VMs with a network tag or running as an SA
Target tags / SAsApply only to VMs with a tag or SA
Protocols/portse.g. tcp:80,443, udp:53, all
📝 Code, the BSP standard nexus-vm firewall posture
# SSH only from IAP
gcloud compute firewall-rules create allow-ssh-iap \
  --network=bsp-prod-vpc --direction=INGRESS --action=ALLOW \
  --rules=tcp:22 --source-ranges=35.235.240.0/20 \
  --target-tags=ssh-iap

# HTTPS from Cloudflare only
gcloud compute firewall-rules create allow-https-cf \
  --network=bsp-prod-vpc --direction=INGRESS --action=ALLOW \
  --rules=tcp:443 --source-ranges=173.245.48.0/20,103.21.244.0/22,103.22.200.0/22,103.31.4.0/22,141.101.64.0/18,108.162.192.0/18,190.93.240.0/20,188.114.96.0/20,197.234.240.0/22,198.41.128.0/17,162.158.0.0/15,104.16.0.0/13,104.24.0.0/14,172.64.0.0/13,131.0.72.0/22 \
  --target-tags=web

# Default deny all other ingress (priority 65534, hardened)
gcloud compute firewall-rules create deny-all-ingress \
  --network=bsp-prod-vpc --direction=INGRESS --action=DENY \
  --rules=all --source-ranges=0.0.0.0/0 --priority=65534
⚠️ Gotcha, default network is too permissive

Auto-created default network ships with default-allow-ssh, default-allow-rdp, default-allow-icmp, and default-allow-internal rules wide open to 0.0.0.0/0. Production should use a custom VPC with no default rules. If you must keep default, delete default-allow-ssh and default-allow-rdp immediately.

⚠️ Gotcha, network tags vs target service accounts

Network tags are unauthenticated metadata, anyone with compute.instances.setTags can add the web tag to any VM and inherit its firewall rules. Target service accounts require iam.serviceAccounts.actAs and are the secure default for production. Migrate from tags to SAs in the firewall rules.

3.4 Cloud NAT, Private Google Access, Private Service Connect

3.5 Cloud Load Balancing variants

TypeLayerScopeUse
Global external Application LBL7 HTTPSGlobal anycast IPPublic web apps, multi-region failover
Regional external Application LBL7 HTTPSRegionalSingle-region web with custom auth
Internal Application LBL7 HTTPSRegional, VPC-internalMicroservices inside the VPC
Global external Network LBL4 TCP/UDP/SSLGlobal anycast IPNon-HTTP global, e.g. game servers
Regional external Network LBL4 TCP/UDPRegionalPass-through, preserves source IP
Internal Network LBL4 TCP/UDPRegional, VPC-internalInternal pass-through

3.6 Cloud Armor

WAF for the global Application Load Balancer. Features: pre-configured OWASP rules, rate-limiting (per-IP per-minute thresholds), bot management, geo-based allow/deny (GeoIP), Adaptive Protection (ML-based DDoS), reCAPTCHA Enterprise integration. Enabled per backend service. The Cloudflare in front of nexus-vm handles much of this today, but if we move to a GCP load balancer, Cloud Armor takes over.

3.7 Cloud CDN

Edge caching tied to the global Application LB. Set --enable-cdn on a backend service. Cache keys default to host + path, customizable. Negative caching for 404s. Cache invalidation via gcloud compute url-maps invalidate-cdn-cache. Today Cloudflare is our CDN, Cloud CDN is the migration target if we leave Cloudflare.

3.8 Cloud DNS

Authoritative managed DNS. Two zone types: public (resolvable on the internet) and private (resolvable only inside designated VPCs). DNSSEC available, DNS forwarding for hybrid (on-prem to GCP). Today callbrightside.com DNS is on Cloudflare; Cloud DNS is the option if we centralize on Google.

3.9 VPC peering, Shared VPC, VPC-SC

3.10 Interconnect, VPN, Network Connectivity Center

Hybrid cloud options:

3.11 IAP for HTTPS

Beyond TCP forwarding (Section 1.9), IAP can sit in front of an HTTPS Load Balancer to add Google identity authentication on top of any backend (GCE, GKE, Cloud Run, App Engine). Set --enable-iap on the backend, the LB returns a Google sign-in flow before the request reaches the backend. The signed JWT is forwarded as X-Goog-IAP-JWT-Assertion for the backend to verify.

3.12 Static and ephemeral IPs, IP forwarding

📝 Code, verify nexus-vm has a static IP
gcloud compute addresses list --filter="address=34.55.179.122"
# If empty, the IP is ephemeral. Promote it:
gcloud compute addresses create nexus-vm-static \
  --addresses=34.55.179.122 \
  --region=us-central1

3.13 Network telemetry: VPC Flow Logs, Mirror, Intelligence Center

Browser TLS 1.3 Cloudflare DNS, CDN, WAF zone a87220882ed631dd4dfb orange cloud proxy GCP edge 34.55.179.122 network LB pass-through VPC firewall deny-all + allow-cf + allow-ssh-iap nexus-vm subnet 10.10.0.0/24 internal IP 10.10.0.x tags: web, ssh-iap
Figure 3.1, the path a request takes from a browser to the Python framework on nexus-vm.
✅ Production checklist, Networking
🎓 FOR NEW HIRE, Networking field guide

"VPC" = software-defined network. "Subnet" = the per-region range of IPs. "Firewall rule" = which ports/sources can reach which VMs. "Load balancer" = the front door if we have more than one VM. Use gcloud compute networks list, gcloud compute networks subnets list, gcloud compute firewall-rules list to see the current state. When something is unreachable, the order of debugging is: (1) DNS, (2) Cloudflare, (3) firewall rule, (4) the VM's own iptables, (5) the application listening on the port. The Network Intelligence Center connectivity test does the first three for you.


💾 4. Storage, GCS, Cloud SQL HIGH PRIORITY

Storage on GCP comes in three flavors that matter for nexus-vm: object storage (GCS), block storage (PD/Hyperdisk attached to the VM), and managed databases (Cloud SQL). This section covers all three plus Filestore for shared files.

4.1 GCS storage classes

ClassMin durationStorage $/GB-moRetrieval $/GBUse
StandardNone~$0.020$0Hot, frequent access
Nearline30 days~$0.010$0.01Monthly backups
Coldline90 days~$0.004$0.02Quarterly backups
Archive365 days~$0.0012$0.05Compliance, long-term

Min duration means you pay as if the object lived that long even if you delete earlier. All classes have the same single-digit-millisecond first-byte latency, the difference is purely cost-vs-retention. Multi-regional and dual-regional buckets cost slightly more for higher availability. Source: cloud.google.com/storage/pricing.

4.2 Lifecycle, versioning, retention, bucket lock

Lifecycle rules transition or delete objects based on age, version count, or class.

📝 Code, lifecycle for nexus-vm snapshots in GCS
# lifecycle.json
{
  "lifecycle": {
    "rule": [
      {"action": {"type": "SetStorageClass", "storageClass": "NEARLINE"},
       "condition": {"age": 30}},
      {"action": {"type": "SetStorageClass", "storageClass": "COLDLINE"},
       "condition": {"age": 90}},
      {"action": {"type": "Delete"},
       "condition": {"age": 365}}
    ]
  }
}

gcloud storage buckets update gs://bsp-backups \
  --lifecycle-file=lifecycle.json

Object versioning keeps prior versions when you overwrite or delete. Retention policy enforces a minimum age before deletion. Bucket Lock makes a retention policy permanent (cannot be removed, only extended). Use Bucket Lock for compliance buckets.

4.3 IAM vs ACLs on buckets

Two access control systems coexist:

Set UBLA at bucket creation: gcloud storage buckets create gs://bsp-backups --uniform-bucket-level-access. The Object ACL system is essentially deprecated for new buckets. Source: cloud.google.com/storage/docs/uniform-bucket-level-access.

4.4 Signed URLs, signed policies, CORS

A signed URL grants time-bounded access to a single object using a service account's private key. CORS lets browser-based JS upload/download from a bucket.

📝 Code, signed URL in Python
from google.cloud import storage
from datetime import timedelta

client = storage.Client()
bucket = client.bucket("bsp-uploads")
blob = bucket.blob("reports/q2-2026.pdf")

url = blob.generate_signed_url(
    version="v4",
    expiration=timedelta(minutes=15),
    method="GET",
)
print(url)
⚠️ Gotcha, signed URLs require a key or signBlob permission

To sign with a service account on a GCE VM, the VM's SA needs iam.serviceAccounts.signBlob on itself. Otherwise you get an opaque error about no private key. Grant roles/iam.serviceAccountTokenCreator on the SA to the SA itself.

4.5 Cloud SQL configurations

EngineVersionsNotes
MySQL5.7, 8.0, 8.45.7 EOL approaching
PostgreSQL11, 12, 13, 14, 15, 16, 1714, 15, 16 supported. 11 EOL.
SQL Server2017, 2019, 2022 (Std, Enterprise, Web, Express)License included or BYOL

Tiers: shared-core (db-f1-micro, db-g1-small, retired in 2024+ for some versions), custom (1-96 vCPU, 0.9-624 GB RAM), high-memory presets. Storage: SSD or HDD, autogrow available.

4.6 HA, replicas, PITR, backups, maintenance

4.7 Cloud SQL Auth Proxy and IAM auth

The Auth Proxy is a small binary (Go) that establishes a TLS-encrypted tunnel from your application to Cloud SQL using your Google credentials, no password needed for the client side and no firewall rules to manage. IAM authentication lets a Google identity log into Postgres/MySQL with a short-lived token.

📝 Code, run the Cloud SQL Auth Proxy on nexus-vm
# Download (Linux amd64)
curl -o cloud-sql-proxy \
  https://storage.googleapis.com/cloud-sql-connectors/cloud-sql-proxy/v2.13.0/cloud-sql-proxy.linux.amd64
chmod +x cloud-sql-proxy

# Connect to a Postgres instance via Unix socket
./cloud-sql-proxy PROJECT_ID:us-central1:bsp-pg --unix-socket=/cloudsql

# In your app, connect to host=/cloudsql/PROJECT_ID:us-central1:bsp-pg

4.8 PD vs Filestore vs GCS, when to use which

NeedPickWhy
Boot disk for nexus-vmpd-balancedDefault, sized to needed IOPS
Database filespd-ssd or hyperdisk-balancedPredictable latency
Shared files across multiple VMsFilestore (NFS)POSIX semantics
Daily backups, snapshots, large objectsGCSCheap, durable, lifecycle
Logs, ML training data, public assetsGCSThroughput scales horizontally
Scratch / cacheLocal SSDLowest latency, ephemeral

4.9 Filestore tiers

TierMin capacityThroughputUse
Basic HDD1 TB100 MB/s/TBSequential, low cost
Basic SSD2.5 TB1.2 GB/sMixed workloads
Zonal (Enterprise)1 TBscales linearlySLA-backed, single zone
Regional1 TBscales linearlyHA across zones
Enterprise (legacy)1 TBscales linearlyReplaced by Regional

4.10 Backup and DR strategies for nexus-vm

Storage tiers, cost vs latency latency, ms (log) cost $/GB-mo Local SSD pd-ssd pd-balanced pd-standard GCS Standard GCS Nearline Coldline Archive
Figure 4.1, storage tiers plotted by cost vs first-byte latency. Durability is 11 nines for all GCS classes and PD-replicated tiers.
✅ Production checklist, Storage
🎓 FOR NEW HIRE, Storage in one paragraph

GCS holds anything that is not actively being read by a database (assets, backups, logs, ML data). Disks live attached to a VM and hold the OS and database. Cloud SQL is a managed MySQL/Postgres/SQL Server. Pick GCS by default, pick a Disk only when you need POSIX file semantics on a single VM, pick Cloud SQL when you need ACID transactions and don't want to operate Postgres yourself. The Python SDK for GCS is google-cloud-storage, install with pip install google-cloud-storage and the docs are at cloud.google.com/python/docs/reference/storage.


🔒 5. Secret Manager, Cloud KMS MEDIUM PRIORITY

Secret Manager holds the application secrets that nexus-vm needs (Anthropic API key, Cloudflare API token, BRICKS_WP_APP_PASSWORD, Vapi API key). Cloud KMS holds the encryption keys that protect everything else. Different tools, different jobs, often confused.

5.1 Secret Manager: model and versioning

A secret is a named container. Each secret has multiple versions (1, 2, 3...), only one of which is the latest at any time. Versions are immutable. To rotate a secret, add a new version, point your app at latest or pin to a specific version. Each access is auditable.

📝 Code, the standard Secret Manager workflow
# Create a secret
gcloud secrets create BRICKS_WP_APP_PASSWORD \
  --replication-policy=automatic

# Add a version (rotation)
echo -n "new_app_password_value" | \
  gcloud secrets versions add BRICKS_WP_APP_PASSWORD --data-file=-

# Read the latest version (from inside nexus-vm)
gcloud secrets versions access latest --secret=BRICKS_WP_APP_PASSWORD

# Disable an old version (preferred during rollout)
gcloud secrets versions disable 3 --secret=BRICKS_WP_APP_PASSWORD

# Destroy after the rollout is verified (irreversible)
gcloud secrets versions destroy 3 --secret=BRICKS_WP_APP_PASSWORD

5.2 Replication and CMEK

5.3 Rotation and notifications

Secret Manager has built-in rotation scheduling. You set a --next-rotation-time and --rotation-period, and Secret Manager publishes a Pub/Sub message at the scheduled time. Your rotation handler (Cloud Function, Cloud Run job, etc.) creates a new version. Source: cloud.google.com/secret-manager/docs/rotation-recommendations.

5.4 Cloud KMS: model

Hierarchy: key ringkeykey version. Key rings are regional (or global, or multi-regional). A key has a purpose (symmetric encryption, asymmetric signing, asymmetric decryption, MAC), an algorithm, and a protection level (software or HSM). Each version is the actual cryptographic material. KMS never returns the raw key, you call encrypt, decrypt, sign, or verify.

5.5 CMEK on Compute, GCS, Cloud SQL

Customer-Managed Encryption Keys override the default Google-managed encryption with a key in your KMS. Apply at resource create time:

📝 Code, create a GCS bucket with CMEK
gcloud storage buckets create gs://bsp-cmek-test \
  --default-encryption-key=projects/PROJECT_ID/locations/us-central1/keyRings/bsp-prod/cryptoKeys/gcs-cmek-1

# Grant GCS service agent permission to use the key
gcloud kms keys add-iam-policy-binding gcs-cmek-1 \
  --keyring=bsp-prod --location=us-central1 \
  --member=serviceAccount:service-PROJECT_NUMBER@gs-project-accounts.iam.gserviceaccount.com \
  --role=roles/cloudkms.cryptoKeyEncrypterDecrypter

5.6 CSEK (legacy) and external HSM

Customer-Supplied Encryption Keys let you provide raw bytes per request. Mostly deprecated in favor of CMEK. External HSM via Cloud HSM (managed) or Cloud External Key Manager (your HSM at a partner like Equinix). Skip unless mandated by compliance.

5.7 Secret Manager into nexus-vm

📝 Code, fetch a secret from Python on nexus-vm
from google.cloud import secretmanager

def get_secret(name: str) -> str:
    client = secretmanager.SecretManagerServiceClient()
    project = "bsp-prod"
    path = f"projects/{project}/secrets/{name}/versions/latest"
    response = client.access_secret_version(request={"name": path})
    return response.payload.data.decode("utf-8")

bricks_pwd = get_secret("BRICKS_WP_APP_PASSWORD")
cf_token = get_secret("CLOUDFLARE_API_TOKEN")
⚠️ Gotcha, never log secret values

Set up a logging filter that drops any field containing BRICKS_WP_APP_PASSWORD, CLOUDFLARE_API_TOKEN, ANTHROPIC_API_KEY. The fastest way to leak a secret is to print(get_secret(...)) during a debug session and forget. Secret Manager itself only audits "this secret was accessed", which still leaves you to track down where it went.

✅ Production checklist, Secrets and KMS
🎓 FOR NEW HIRE, Secret Manager rule

If a value would let someone act as you, it goes in Secret Manager. Period. No .env files committed to git, no values pasted in Slack, no config.py with constants. The nexus-vm service account gets read access at runtime and Secret Manager logs every access for the audit trail.


📊 6. Observability, Logging, Monitoring HIGH PRIORITY

If nexus-vm goes sideways, observability is how you find out before Robert does. Cloud Logging, Cloud Monitoring, Cloud Trace, Profiler, and Error Reporting are five products under the umbrella name "Cloud Operations Suite" (formerly Stackdriver).

6.1 Cloud Logging architecture

Every log entry is a structured JSON document with a timestamp, a log name, a severity, a resource label set, and a payload. Entries flow into buckets (storage with retention), filtered by sinks (which entries go to which bucket or destination). The _Default bucket holds 30 days, the _Required bucket holds Admin Activity audit logs for 400 days, both free.

ConceptWhat it is
Log entryOne row, structured fields, payload
Log nameLogical stream, e.g. cloudaudit.googleapis.com%2Factivity
BucketStorage location with retention policy
SinkFilter expression + destination (bucket, BigQuery, GCS, Pub/Sub)
ViewRestricts who sees which entries inside a bucket
ScopeCross-project log scope for a single Logs Explorer

6.2 Logs Explorer query language

📝 Code, common Logs Explorer queries
# All warnings/errors from nexus-vm in the last hour
resource.type="gce_instance"
resource.labels.instance_id="123456789"
severity>=WARNING
timestamp>="2026-04-28T18:00:00Z"

# IAM policy changes, last 7 days
logName=~"cloudaudit.googleapis.com%2Factivity"
protoPayload.serviceName="iam.googleapis.com"
protoPayload.methodName=~"SetIamPolicy"

# Failed SSH attempts
resource.type="gce_instance"
jsonPayload.message=~"Failed password for"

6.3 Retention, log-based metrics, alerts

Log-based metrics convert a log filter into a counter or distribution. Useful for "alert me when error rate > 1/min."

📝 Code, create a log-based metric for nexus-vm 5xx
gcloud logging metrics create nexus_5xx \
  --description="5xx responses on nexus-vm" \
  --log-filter='resource.type="gce_instance" AND jsonPayload.status>=500'

6.4 Cloud Monitoring

Workspaces, dashboards, MQL (Monitoring Query Language) for advanced queries, alerts with conditions and notification channels (email, Slack, PagerDuty, webhook). Default integrations exist for every GCE metric (CPU, disk, network, instance/up).

6.5 Uptime checks and SLOs

Uptime checks ping a public URL from 6 global locations every minute. Failures trigger alerts. Configure in Monitoring → Uptime checks. Add an HTTPS check for https://morpheus.callbrightside.com.

📝 Code, define an uptime check
gcloud monitoring uptime create morpheus-https \
  --resource-type=uptime-url \
  --resource-labels="host=morpheus.callbrightside.com,project_id=PROJECT_ID" \
  --request-method=GET --path=/ --port=443 \
  --period=60s --timeout=10s

6.6 Cloud Trace, Profiler, Error Reporting

6.7 Ops Agent vs deprecated agents

The Ops Agent (single binary, Linux + Windows) replaces the legacy Stackdriver Logging Agent and Monitoring Agent. Configures via /etc/google-cloud-ops-agent/config.yaml.

📝 Code, ship Python framework logs from /opt/nexus to Cloud Logging
# /etc/google-cloud-ops-agent/config.yaml
logging:
  receivers:
    nexus_app:
      type: files
      include_paths:
        - /opt/nexus/logs/*.log
        - /opt/nexus/nexus/scripts/output/*.log
      record_log_file_path: true
  processors:
    json_parse:
      type: parse_json
  service:
    pipelines:
      nexus_pipeline:
        receivers: [nexus_app]
        processors: [json_parse]
metrics:
  receivers:
    hostmetrics:
      type: hostmetrics
      collection_interval: 60s
  service:
    pipelines:
      default_pipeline:
        receivers: [hostmetrics]

# Apply
sudo systemctl restart google-cloud-ops-agent
🔥 Recency, legacy agents EOL

The legacy Logging Agent and Monitoring Agent were deprecated in 2023 and reach end of support October 2024 / Q1 2025. New installs must use Ops Agent. If you find google-fluentd or stackdriver-agent on the VM, that is an upgrade you owe yourself. Source: cloud.google.com/stackdriver/docs/deprecations.

6.8 Audit logs (cross-reference)

See Section 2.9. Admin Activity is always on, free, 400-day retention.

6.9 Common alerting patterns

6.10 Observability cost

Logging: $0.50/GiB ingest, free egress to bucket, then storage $0.01/GiB-mo after 30 days for default bucket. Monitoring: free for resource metrics, $0.2580/MiB for chargeable metrics. Trace: $0.20 per million spans. Profiler: free.

6.11 Shipping /opt/nexus logs to Cloud Logging

Section 6.7's config is the right answer. Two patterns to know:

📝 Code, Python direct logging into Cloud Logging
import logging
from google.cloud import logging as cloud_logging

cloud_logging.Client().setup_logging(log_level=logging.INFO)

logger = logging.getLogger("nexus.runner")
logger.info("job_started", extra={"json_fields": {"job_id": "abc"}})
/opt/nexus logs /var/log + journald Ops Agent files + hostmetrics Cloud Logging Cloud Monitoring Error Reporting Sinks: BQ, GCS, Pub/Sub Alerts: email, Slack, webhook
Figure 6.1, observability pipeline from nexus-vm Python framework into Cloud Logging, Monitoring, and downstream sinks.
✅ Production checklist, Observability
🎓 FOR NEW HIRE, Observability mental model

Logs answer "what happened." Metrics answer "is it normal." Traces answer "where did the request spend time." Errors answer "what's broken." Start with logs and metrics; learn traces when you need them. The first place to look during an incident is Logs Explorer, filter to severity>=ERROR and the time window you care about. The Logs Explorer query language is documented at cloud.google.com/logging/docs/view/logging-query-language.


📝 7. APIs, Auth, SDKs MEDIUM PRIORITY

Every Google Cloud product is an HTTP API. Every API call passes through three checkpoints: API enablement (is the service turned on for this project), authentication (who is calling), and authorization (Section 2 IAM). Understanding the layers makes 401/403 errors trivial.

7.1 APIs catalog and enablement

Each API is identified by a service name like compute.googleapis.com, storage.googleapis.com, secretmanager.googleapis.com, aiplatform.googleapis.com. APIs must be explicitly enabled per project before any call works.

📝 Code, manage API enablement
# List enabled APIs
gcloud services list --enabled

# Enable an API
gcloud services enable secretmanager.googleapis.com

# Disable (will fail if resources exist)
gcloud services disable bigtable.googleapis.com

7.2 OAuth 2.0 flows

FlowUse
Authorization CodeWeb apps acting on behalf of a user
Authorization Code + PKCENative and mobile apps
Client Credentials (JWT bearer)Service accounts
Implicit (legacy)Avoid
Device flowTVs, CLI on a remote box

7.3 Application Default Credentials (ADC)

The discovery order Google client libraries follow when looking for credentials:

  1. GOOGLE_APPLICATION_CREDENTIALS env var pointing at a JSON key file
  2. gcloud user credentials in ~/.config/gcloud/application_default_credentials.json
  3. Attached service account on a GCE VM, Cloud Run, Cloud Functions, App Engine, GKE Workload Identity
  4. External account (WIF) configured via gcloud iam workload-identity-pools create-cred-config

Use gcloud auth application-default login for local development. On nexus-vm, no env var is needed; the attached SA is auto-discovered via the metadata service.

7.4 gcloud config and named profiles

📝 Code, multi-account gcloud profiles
gcloud auth login                                  # browser flow
gcloud config configurations create bsp-prod
gcloud config set project bsp-prod
gcloud config set compute/zone us-central1-a
gcloud config set account robert.dove@callbrightside.com

gcloud config configurations list
gcloud config configurations activate bsp-prod

7.5 Cloud Shell vs local development

Cloud Shell is a free Linux VM in your browser pre-loaded with gcloud, kubectl, terraform, docker, python, node. Persists 5 GB of $HOME. Sessions auto-expire after 60 minutes idle. Useful when your local machine doesn't have gcloud, or when you want to test as a different identity without polluting local config.

7.6 Python SDK clients

LibraryInstallFor
google-cloud-storagepip install google-cloud-storageGCS
google-cloud-secret-managerpip install google-cloud-secret-managerSecret Manager
google-cloud-computepip install google-cloud-computeCompute Engine API
google-cloud-loggingpip install google-cloud-loggingCloud Logging
google-cloud-monitoringpip install google-cloud-monitoringCloud Monitoring
google-cloud-pubsubpip install google-cloud-pubsubPub/Sub
google-cloud-aiplatformpip install google-cloud-aiplatformVertex AI
google-cloud-bigquerypip install google-cloud-bigqueryBigQuery

7.7 REST vs gRPC

Google client libraries default to gRPC where supported (faster, streaming, smaller wire). REST is the fallback (firewall friendlier, easier to debug with curl). Most products support both; the Python SDK abstracts the choice. Compute Engine, Cloud SQL Admin, and a handful of older APIs are REST-only.

7.8 API versioning

Versions follow v1, v1beta1, v1alpha1. Beta is supported for production but breaking changes may occur. Alpha is allowlist-only. Pin to a specific version in your client library imports to avoid surprises.

7.9 Quotas and rate limiting

Every API has per-minute and per-day quotas. View at IAM & Admin → Quotas & System Limits. Common surprises: Compute Engine API Persistent disks (GB) regional quota, Cloud Logging Write API requests per minute, Cloud Functions Concurrent function executions. Request increases via the console; turnaround is hours-to-days.

7.10 Backoff and retry patterns

📝 Code, exponential backoff with the Google client library
from google.api_core import retry
from google.cloud import storage

custom_retry = retry.Retry(initial=1.0, multiplier=2.0, maximum=30.0, deadline=300.0)

client = storage.Client()
bucket = client.bucket("bsp-backups")
blob = bucket.blob("daily.tar.gz")
blob.upload_from_filename("daily.tar.gz", retry=custom_retry)
⚠️ Gotcha, idempotency and retries

Auto-retry only works safely on idempotent operations (GET, PUT with full payload, DELETE). For non-idempotent POSTs (create instance), wrap in requestId to dedupe. Compute Engine accepts a requestId on most insert operations.

✅ Production checklist, APIs
🎓 FOR NEW HIRE, the Python + GCP starter kit

You will live mostly in Python. The "official" Google client libraries follow a consistent shape: Client object → resource methods. Read cloud.google.com/python/docs/reference as your bookmark page. Bash and gcloud are the second language for ops scripting. Go and TypeScript surface in two contexts only: Cloud Functions / Cloud Run if we go serverless (TS or Python or Go), and the Cloud SQL Auth Proxy / Ops Agent (Go internals). You can ship effective work for years on Python + Bash + a little gcloud.


🚀 8. Build, Deploy, IaC MEDIUM PRIORITY

How code gets from a git push to running on production. The BSP nexus-vm stack today is updated by SSH + git pull + systemd restart. The mature future is Cloud Build → Artifact Registry → deploy.

8.1 Cloud Build

Hosted CI. Each build runs in a sandbox using a sequence of Docker steps defined in cloudbuild.yaml. Triggers on git push (GitHub, GitLab, Bitbucket, Cloud Source Repositories) or webhook. Outputs land in Artifact Registry, GCS, or anywhere the build calls.

📝 Code, minimal cloudbuild.yaml for nexus-vm Python sync
# cloudbuild.yaml
steps:
  - name: "python:3.11-slim"
    entrypoint: bash
    args:
      - -c
      - "pip install -r requirements.txt && python -m pytest -q"
  - name: "gcr.io/cloud-builders/gcloud"
    args: ["compute", "ssh", "nexus-vm", "--zone=us-central1-a",
           "--command=cd /opt/nexus && git pull && sudo systemctl restart nexus.service"]
options:
  logging: CLOUD_LOGGING_ONLY
timeout: "600s"

8.2 Artifact Registry

The successor to Container Registry (gcr.io). Holds container images, plus Maven, npm, Python (PyPI-style), Apt, Yum, Go module, generic file repos. Regional or multi-regional. Per-repo IAM. Vulnerability scanning available (Container Analysis API). Source: cloud.google.com/artifact-registry/docs.

🔥 Recency, Container Registry sunset

Container Registry (gcr.io) was deprecated in 2023 and is being shut down. New images go to Artifact Registry. Existing gcr.io/PROJECT/image URLs auto-redirect via the pkg.dev Artifact Registry mirror, but you should migrate explicitly. Run the migration tool: gcloud artifacts docker upgrade migrate.

8.3 Cloud Deploy

Managed delivery pipeline (continuous delivery). Stages a release through a chain of environments (dev → staging → prod) with manual or automatic promotion gates. Native targets are GKE, Cloud Run, and recently GCE MIG. For a single VM, the value is lower; we lean on Cloud Build directly today.

8.4 IaC choices: Terraform, Deployment Manager, gcloud, Config Connector

ToolStatusUse
TerraformIndustry standard, recommendedMulti-cloud, broad community
Deployment ManagerDeprecated 2024, EOLLegacy projects only, migrate
gcloud / scriptsActiveOne-offs, ad hoc
Config ConnectorActiveManage GCP from inside K8s, GitOps
PulumiActive third-partyIf you prefer real code over HCL
📝 Code, minimal Terraform for nexus-vm
# main.tf
terraform {
  required_version = ">= 1.5"
  required_providers {
    google = { source = "hashicorp/google", version = "~> 5.0" }
  }
  backend "gcs" {
    bucket = "bsp-tfstate"
    prefix = "nexus/prod"
  }
}

provider "google" {
  project = "bsp-prod"
  region  = "us-central1"
}

resource "google_compute_instance" "nexus_vm" {
  name         = "nexus-vm"
  machine_type = "n2-standard-2"
  zone         = "us-central1-a"
  boot_disk { initialize_params { image = "debian-cloud/debian-12", size = 50 } }
  network_interface {
    network    = google_compute_network.bsp_prod_vpc.id
    subnetwork = google_compute_subnetwork.nexus_subnet.id
    access_config { nat_ip = google_compute_address.nexus_static.address }
  }
  service_account {
    email  = google_service_account.nexus_runner.email
    scopes = ["cloud-platform"]
  }
  shielded_instance_config { enable_secure_boot = true, enable_vtpm = true, enable_integrity_monitoring = true }
  tags = ["web", "ssh-iap"]
}

8.5 Cloud Source Repositories

Google's git hosting. Free for up to 5 users, 50 GB. Mirrors GitHub repos for inside-VPC pulls. Mostly used as a Cloud Build source mirror. Today BSP source lives on GitHub; CSR is optional.

8.6 GitHub Actions / GitLab CI integration via WIF

See Section 2.4. The keyless path replaces uploading a service account JSON key to GitHub. The official action is google-github-actions/auth@v2.

📝 Code, GitHub Actions step using WIF
# .github/workflows/deploy.yml
- uses: google-github-actions/auth@v2
  with:
    workload_identity_provider: "projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/gh-pool/providers/gh-provider"
    service_account: "ci-deployer@bsp-prod.iam.gserviceaccount.com"

- run: gcloud compute ssh nexus-vm --zone=us-central1-a --command="cd /opt/nexus && git pull && sudo systemctl restart nexus"

8.7 Terraform state on GCS

Use a dedicated GCS bucket as Terraform's remote backend. Enable versioning, locking via TF state lock (default for GCS backend in TF 1.6+). Restrict the bucket IAM to only the CI service account and humans who need to import state by hand.

8.8 Deployment patterns for nexus-vm

  1. Today, manual SSH + git pull + systemctl restart. Works for a single operator.
  2. Step up, Cloud Build trigger on push to main runs tests, then SSHes in to deploy. Adds an audit trail.
  3. Mature, build a custom image weekly, swap a MIG of size 1, drain old VM. Adds rollback.

8.9 Rollbacks

Without IaC, rollback is "git revert + redeploy." With snapshots and instance templates, rollback is "revert MIG to template version v(N-1)" which can complete in minutes.

✅ Production checklist, Build & Deploy
🎓 FOR NEW HIRE, deploy paths

Today: SSH in, git pull in /opt/nexus, sudo systemctl restart nexus.service. Always look at journalctl -u nexus.service -n 100 -f after a deploy. Tomorrow: a GitHub Actions workflow does it for you when a PR merges. Either way, the pattern of "pull + restart + watch logs" is the same.


⚡ 9. Serverless LOWER PRIORITY

Serverless on GCP means "you bring code, Google runs it on demand." Lower priority for our single-VM stack today, but the right answer for many things we currently shoehorn into nexus-vm cron.

9.1 Cloud Run

The flagship. Containers (any language, any base image), HTTP and grpc, scales 0 to N. Pay per 100ms of CPU + memory while serving. Two flavors: services (long-running HTTP) and jobs (run to completion). Concurrency per container configurable, 1 to 1000. Default 80.

FeatureDetail
Cold start~100ms-2s depending on image size
Request timeoutup to 60 minutes (services), 24 hours (jobs)
Memory128 MiB to 32 GiB
CPU1, 2, 4, 8 vCPU
Min instances0 default; set > 0 to avoid cold starts at a cost
VPC connectorDirect VPC egress (preview/GA), or Serverless VPC Access
AuthPublic, IAP-protected, or invoker IAM
📝 Code, deploy a Python Cloud Run service from source
gcloud run deploy nexus-helper \
  --source=. --region=us-central1 \
  --allow-unauthenticated --memory=512Mi --cpu=1 \
  --service-account=nexus-runner@bsp-prod.iam.gserviceaccount.com \
  --set-secrets=ANTHROPIC_API_KEY=anthropic-key:latest

9.2 Cloud Functions Gen 1 vs Gen 2

Gen 2 is Cloud Run under the hood with a function-shaped interface. Gen 1 is the legacy. New code goes to Gen 2 (longer timeouts, larger instances, concurrency > 1, better events). Gen 1 is in maintenance.

9.3 App Engine Standard vs Flex

For new builds, default to Cloud Run unless you have an existing App Engine app.

9.4 Pub/Sub

Globally available message queue. Decouples producers and consumers. Two delivery modes: push (Pub/Sub posts to your endpoint) and pull (your worker fetches). At-least-once by default, exactly-once available with ordering keys + filtering. Schema validation, dead letter topics. Source: cloud.google.com/pubsub/docs.

9.5 Cloud Scheduler, Tasks, Workflows

ServiceUse
Cloud SchedulerCron-as-a-service. Hits HTTP, Pub/Sub, App Engine.
Cloud TasksPer-item queue with rate limiting, dispatch retry, delay.
WorkflowsYAML state machine for multi-step orchestration.

9.6 Eventarc

Event router that turns audit log events, GCS object writes, Pub/Sub messages, BigQuery jobs, and SaaS webhooks into Cloud Run / GKE invocations. Use for "when an object lands in this bucket, run that handler" without writing glue.

9.7 Cost models

✅ Production checklist, Serverless
🎓 FOR NEW HIRE, when to reach for serverless

Anything that runs on a schedule and finishes in under 10 minutes is a great Cloud Scheduler + Cloud Run job candidate. Anything that responds to events (an upload arrived, a Pub/Sub message landed) is Cloud Run or Functions. Anything that runs continuously and holds state is the VM. We default to "put it on the VM" today, but if you find yourself reaching for cron, ask yourself if Cloud Scheduler is nicer.


🏢 10. Project, Org, Billing MEDIUM PRIORITY

Projects are the unit of cost, quota, and IAM. Organizations are the unit of governance. Billing accounts pay the bills. The wiring matters more than people realize.

10.1 Resource hierarchy

Org → Folder (optional, can nest) → Project → Resources. Most BSP work happens in one project (bsp-prod). Recommended additions: bsp-sandbox for safe experiments, bsp-data for BigQuery and analytics with separate billing visibility.

10.2 Organization policies

Constraints applied at Org/Folder/Project level that override IAM. Examples:

📝 Code, set an org policy
gcloud resource-manager org-policies set-policy policy.yaml \
  --organization=ORG_ID

# policy.yaml
constraint: constraints/iam.disableServiceAccountKeyCreation
booleanPolicy: { enforced: true }

10.3 Custom org policy constraints

2023+ feature. Define your own constraint in CEL targeting any GCP resource field. Example: "every GCS bucket must be in us-central1." Source: cloud.google.com/resource-manager/docs/organization-policy/custom-constraints.

10.4 Quotas

Per-project, per-region, per-API. Soft limits, increase via console request. Common single-VM ones to know:

10.5 Billing accounts and BigQuery exports

Billing accounts pay one or many projects. Separate the production billing account from the sandbox so a runaway dev VM does not blow the prod budget. Enable BigQuery billing export for accurate per-resource cost analysis.

📝 Code, enable BigQuery billing export
# In the Console: Billing → Billing export → BigQuery export
# Creates a dataset like: PROJECT.billing_export

# Sample query: top 10 services by cost last 30 days
SELECT
  service.description AS service,
  ROUND(SUM(cost), 2) AS cost_usd
FROM `PROJECT.billing_export.gcp_billing_export_v1_BILLING_ID`
WHERE usage_start_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY service
ORDER BY cost_usd DESC
LIMIT 10;

10.6 Budgets and alerts

Set per-billing-account or per-project budgets with alerts at thresholds (50%, 75%, 90%, 100%, 150% of forecast). Notifications go to email and Pub/Sub. Pub/Sub can trigger automatic remediation (e.g. shut down sandbox VMs).

10.7 Asset Inventory and Recommender

10.8 Project metadata: labels and tags

Two distinct concepts:

✅ Production checklist, Project & Org
🎓 FOR NEW HIRE, project anatomy

If you ever feel "I want to try this without risk," ask Robert to point you at the sandbox project. Production isolation is real. The first thing in any console session: check the project picker top-left and confirm you are in the right project. The number of incidents caused by being in the wrong project is non-trivial.


🛡️ 11. Security Operations MEDIUM PRIORITY

Security Operations on GCP is the sum of detection (Security Command Center), guardrails (Org Policy, Binary Auth, VPC-SC), and forensics (Audit Logs, Asset Inventory).

11.1 Security Command Center tiers

TierCostCapabilities
StandardFreeFindings: Web Security Scanner, Sensitive Action Service, exposed assets
PremiumPer-vCPU + per-bucket pricing+ Event Threat Detection, Container Threat Detection, VM Threat Detection, Posture, Compliance reports (CIS, PCI, HIPAA)
EnterpriseHigher tier+ Mandiant threat intel, MISP integration, SOC features

11.2 Threat detection

11.3 Vulnerability scanning

Container Analysis API scans Artifact Registry images on push. Web Security Scanner runs against your App Engine / Cloud Run / GCE web apps to find OWASP issues. Free tier of SCC includes both at limited frequency.

11.4 Compliance reports

SCC Premium includes pre-built reports against frameworks: CIS GCP Benchmark v1.3, PCI DSS, HIPAA, NIST 800-53, ISO 27001, SOC 2, FedRAMP Moderate/High. Each report shows compliant vs non-compliant resources.

11.5 Binary Authorization and Container Analysis

Binary Auth gates GKE/Cloud Run/Anthos so that only signed, attested container images run. Pair with Container Analysis to require a "no high CVEs" attestation. Out of scope for single-VM nexus-vm but the pattern to know if we move to containers.

11.6 Cloud DLP (Sensitive Data Protection)

Detect and redact PII in text, images, and BigQuery. InfoTypes include US_SSN, EMAIL_ADDRESS, CREDIT_CARD_NUMBER, custom regex. Use during ingest of customer data into BSP analytics: scan a sample with DLP, decide whether the field is allowed.

11.7 Access Transparency and Approval

11.8 Control-plane CMEK

Beyond data CMEK (Section 5.5), control-plane CMEK encrypts metadata about your resources (config, IAM bindings) with your key in some products. Niche, but compliance-relevant.

✅ Production checklist, Security Operations
🎓 FOR NEW HIRE, the security mindset

Default to least privilege. When you write code that needs a permission, grant the smallest predefined role you can find, or build a custom role. Treat every "could this credential leak" with paranoia. The cheapest way to get hacked is a leaked SA key in a public repo, the cheapest defense is Workload Identity Federation. Read the SCC findings tab once a week and learn what the org looks like to a defender.


💰 12. Cost HIGH PRIORITY

A nexus-vm-sized stack on GCP is cheap if you watch it, expensive if you don't. The expensive surprises are predictable; the cheap path is small habits.

12.1 Pricing models

12.2 Free tier

ServiceFree per month
Compute Engine1 e2-micro in us-central1/us-east1/us-west1, 30 GB pd-standard, 1 GB egress to most
GCS5 GB Standard, 5,000 Class A ops, 50,000 Class B ops, 1 GB egress
Cloud Run2M requests, 360,000 vCPU-s, 180,000 GiB-s
Cloud Functions2M invocations, 400k GB-s, 200k GHz-s
Cloud Logging50 GiB ingest, 30-day retention
Cloud MonitoringAll resource metrics + 150 MiB chargeable
Cloud Build120 build-minutes/day
Pub/Sub10 GiB
Secret Manager6 active secrets, 10k access ops

12.3 Optimization strategies

  1. Right-size, use Recommender's "Right-size VMs" insight monthly.
  2. Lifecycle GCS, auto-tier to Nearline at 30 days, Coldline at 90, delete or archive at 365.
  3. Schedule sandbox shutdowns, Cloud Scheduler stops dev VMs nights and weekends.
  4. Buy CUDs, when steady state is locked in.
  5. Tune log ingest, exclude noisy log lines via sink filters before they hit storage.
  6. Avoid cross-region egress, keep workloads in the same region as their data.

12.4 Billing exports to BigQuery

See Section 10.5. The detailed export includes per-resource cost broken down by SKU.

12.5 Common cost surprises

SurpriseWhyMitigation
Egress chargesCross-region or to-internet egress is $0.01-$0.12/GBKeep data in-region, use Cloud CDN, compress responses
Cloud NAT data processing$0.045/GB processed + $0.0045/hr per IPUse Private Google Access for googleapis.com endpoints
Log ingest$0.50/GiB beyond 50 GiBDrop noisy log lines via Sink filter exclusions
Snapshot accumulationSnapshots compound until you deleteLifecycle on snapshot schedule (e.g. keep 7 daily, 4 weekly, 12 monthly)
Idle static IP$0.005/hr while not attached to running VMRelease unused IPs
Cloud Logging rehydrationFetching logs older than retention is expensiveStream to GCS or BQ before retention cliff
Cloud SQL HA when not needed2x costDisable HA on dev/sandbox

12.6 nexus-vm specific cost analysis

Rough monthly estimate, assuming n2-standard-2 (2 vCPU, 8 GB RAM) running 24/7 in us-central1, 50 GB pd-balanced, 1 static IP, ~5 GB Cloud Logging, ~50 GB GCS Standard for backups:

ItemDetail$/month
n2-standard-2 vCPU2 vCPU x 730 hrs x $0.0317~$46.30
n2-standard-2 RAM8 GiB x 730 hrs x $0.00425~$24.83
Sustained use discount~10% off N2 (auto)-$7.10
50 GB pd-balanced boot disk50 x $0.10~$5.00
Static external IP (in use)730 hrs x $0.000~$0.00
Egress to internet~10 GB x $0.085 (assumes US-to-most)~$0.85
Cloud Logging5 GB ingest, free under 50 GiB$0.00
GCS Standard backups50 GB x $0.020~$1.00
Snapshot storage~30 GB compressed x $0.026~$0.78
Secret Manager~10 secrets, 1k ops/mo~$0.06
Total estimateList minus SUD~$71.72
💡 Insight, where a 1-yr CUD pays back

A 1-year resource-based CUD on 2 vCPU + 8 GB RAM in us-central1 saves ~37% off N2 list. That is roughly $26/mo savings on a $70 base, paying back inside the first month and locking in the rate for 12 months. Caveat: you keep paying for the committed amount even if you delete the VM.

⚠️ Gotcha, egress is the most-asked-about line

If your monthly bill jumps by $50 unexpectedly, look at egress first. A misconfigured backup that pulls 600 GB to a non-Google destination is $50 of egress out of nowhere. Run the BQ billing query by sku filter on %egress%.

✅ Production checklist, Cost
🎓 FOR NEW HIRE, cost discipline in 90 seconds

Before you create anything, ask: how much does this cost per month if I forget to delete it? GCP bills by the second; a forgotten dev VM at $50/month is $1.65/day. The dashboard at Console → Billing → Reports tells you in <30 seconds. Bookmark it. Look at it weekly.


🖥️ 13. Console UI LOWER PRIORITY

The web console at console.cloud.google.com is mostly self-explanatory, but a handful of patterns save real time.

13.1 URL structure

Every page has a deep link. https://console.cloud.google.com/compute/instances?project=PROJECT_ID jumps directly to the instance list for a project. Bookmark these for the resources you visit daily.

13.2 Cloud Shell

Click the >_ icon top-right to open a Linux shell in your browser, no install. Pre-loaded with gcloud, kubectl, terraform, docker, python, node, vim. $HOME persists 5 GB. Sessions expire after 60 minutes idle.

13.3 Activity feed

Console → Home → Activity. A timeline of every Admin Activity audit event in the project. Useful first stop for "who changed what."

13.4 Search bar

Top-bar search auto-completes resource names across services. Search for nexus-vm and you get the GCE instance, related disks, snapshots, and any logs entries that mention it. Faster than navigating menus.

13.5 Dashboard customization

Console → Home → Dashboard. Add/remove tiles. Pin Monitoring dashboards. Useful for an at-a-glance ops view.

13.6 Mobile app

"Cloud Console" app for iOS/Android. Useful for: viewing alerts, restarting a VM in a pinch, checking the bill on a Sunday morning. Do not run major IaC changes from a phone.

🎓 FOR NEW HIRE, console productivity

Three habits: (1) confirm the project picker every time you open a new tab, (2) press / to jump to search, (3) star resources to pin them in the navigation drawer. The console is fine for exploration; for any change that needs an audit trail, prefer gcloud or Terraform so the change is reviewable.


🔍 14. Troubleshooting HIGH PRIORITY

When something is on fire, a runbook beats panic. This section is the runbook for the failure modes you will actually hit on nexus-vm.

14.1 Cloud Debugger deprecation

🔥 Recency, Cloud Debugger removed

Cloud Debugger was sunset in May 2023. Replacement: Cloud Profiler for performance, plus modern OpenTelemetry-based debugging in your IDE. If you find docs referencing Cloud Debugger, ignore them.

14.2 Connectivity Tests and the NIC introspection

Console → Network Intelligence → Connectivity Tests. Define a source (VM, IP, internet) and destination, run a simulated path. Tells you which firewall rule, route, or peering blocked the traffic. Saves hours of guessing.

📝 Code, run a connectivity test from CLI
gcloud network-management connectivity-tests create nexus-from-cf \
  --source-ip-address=104.16.0.1 \
  --destination-instance=projects/PROJECT_ID/zones/us-central1-a/instances/nexus-vm \
  --destination-port=443 --protocol=TCP

gcloud network-management connectivity-tests describe nexus-from-cf

14.3 Troubleshooter wizards

The console has wizards for: IAM "why can't user X do Y", VPC "why can't VM A reach B", LB "why is health check failing". Run them before guessing.

14.4 Quotas page

IAM & Admin → Quotas & System Limits. When an API call returns 429, look here first. Filter to the service whose quota you suspect.

14.5 Error code reference

CodeMeaningFirst check
400Bad requestValidate request body, region, zone names
401UnauthenticatedADC discovery, expired token, wrong gcloud config
403Permission deniedMissing IAM role; check exact permission string in error
404Not foundResource name typo, wrong project, wrong region
409ConflictConcurrent modification, wait and retry
429Too many requestsQuota or rate limit; check Quotas page
500Internal errorRetry with backoff; check status.cloud.google.com
503Service unavailableRegional outage; check status page; retry

14.6 IAM troubleshooter step-by-step

  1. Copy the exact permission string from the 403 (e.g. compute.instances.start).
  2. Open Console → IAM & Admin → Troubleshoot.
  3. Enter the user/SA email and the resource (instance URL).
  4. Click "Check access." It returns the inherited bindings and the missing permission.
  5. Grant the smallest predefined role (look it up in cloud.google.com/iam/docs/understanding-roles) that contains the permission.
  6. Re-run the failing call. If still 403, check Org Policy and Deny policies.
✅ Production checklist, Troubleshooting
🎓 FOR NEW HIRE, the calmness algorithm

(1) Read the error literally. (2) Map it to a section in this doc. (3) Run the troubleshooter / connectivity test before guessing. (4) Never paste your fix into production until you can articulate the failure mode in one sentence. (5) When stuck after 30 minutes, ask Robert. Cost of asking: 0. Cost of cascading the wrong fix: hours.


🎯 15. Integration Points with nexus-vm Stack HIGHEST PRIORITY

The longest section by design. Everything above maps to abstractions; this maps to the actual production stack at 34.55.179.122 and the systems it touches. If a future Robert reads only one section to recover from a disaster, this is the one.

15.1 The current stack, in one screen

LayerComponentState todayWhere it lives
DNS & EdgeCloudflare zone a87220882ed631dd4dfbProductionCloudflare
ComputeGCE VM nexus-vmProduction, single VMus-central1-a, IP 34.55.179.122
Filesystem/opt/nexus Python frameworkProductionnexus-vm boot disk
HTTP serviceContext Harness on localhost:8765Productionnexus-vm, systemd-managed
RAG storeZeus, 19,679 chunks, text-embedding-3-smallProductionnexus-vm filesystem + index
Web UImorpheus.callbrightside.comProductionnexus-vm + Cloudflare
WP integrationclaude-api → bricks.callbrightside.com WP RESTProductionHostinger u227696829
SSH~/.ssh/google_compute_engine + dovew userProductionnexus-vm metadata
Secrets (today)OS env vars / .env files on VMTo migrateTo Secret Manager
Backups (today)None automatedTo createTo GCS + snapshot policy

15.2 SSH access patterns, where we are vs where we should be

Today. Robert SSHes via ssh -i ~/.ssh/google_compute_engine dovew@34.55.179.122. The public key is in instance metadata under ssh-keys. This works but has three weaknesses: (a) the VM has a public IP that any bot can scan, (b) revoking access requires editing metadata, (c) audit logs show dovew not robert.dove@callbrightside.com.

Bulletproof target. No public IP. Access via IAP TCP forwarding (Section 1.9.3) gated by OS Login (Section 1.9.1). Audit logs show the Google identity. Revoking access is one IAM binding removal.

📝 Code, the migration plan
# 1. Grant Robert OS Login + IAP roles
gcloud projects add-iam-policy-binding bsp-prod \
  --member=user:robert.dove@callbrightside.com --role=roles/compute.osAdminLogin
gcloud projects add-iam-policy-binding bsp-prod \
  --member=user:robert.dove@callbrightside.com --role=roles/iap.tunnelResourceAccessor

# 2. Add the IAP firewall rule
gcloud compute firewall-rules create allow-ssh-iap \
  --network=bsp-prod-vpc --direction=INGRESS --action=ALLOW \
  --rules=tcp:22 --source-ranges=35.235.240.0/20 --target-tags=ssh-iap

# 3. Test IAP works while public IP is still attached
gcloud compute ssh nexus-vm --zone=us-central1-a --tunnel-through-iap

# 4. Enable OS Login per-instance
gcloud compute instances add-metadata nexus-vm \
  --zone=us-central1-a --metadata enable-oslogin=TRUE

# 5. After 7 days of stable IAP-only operation, drop the public IP
gcloud compute instances delete-access-config nexus-vm \
  --zone=us-central1-a --access-config-name="External NAT"
⚠️ Gotcha, do not drop the public IP without first wiring up the LB

If the VM goes private, public traffic for morpheus.callbrightside.com cannot reach it directly. You need a global Application Load Balancer with the VM as a backend, the LB has the public IP, and the VM only accepts traffic from the LB plus IAP. Plan the LB before pulling the IP.

15.3 Service accounts for nexus-vm and external integration

Recommended SA design:

📝 Code, attach a fresh SA to nexus-vm
# Create the SA
gcloud iam service-accounts create nexus-runner --display-name="Nexus VM Runner"

# Grant the secrets it needs
for secret in ANTHROPIC_API_KEY CLOUDFLARE_API_TOKEN BRICKS_WP_APP_PASSWORD VAPI_API_KEY OPENAI_API_KEY; do
  gcloud secrets add-iam-policy-binding $secret \
    --member=serviceAccount:nexus-runner@bsp-prod.iam.gserviceaccount.com \
    --role=roles/secretmanager.secretAccessor
done

# Switch the VM (requires VM stop)
gcloud compute instances stop nexus-vm --zone=us-central1-a
gcloud compute instances set-service-account nexus-vm \
  --zone=us-central1-a \
  --service-account=nexus-runner@bsp-prod.iam.gserviceaccount.com \
  --scopes=cloud-platform
gcloud compute instances start nexus-vm --zone=us-central1-a

15.4 GCS as the backup destination for /opt/nexus

Pick or create a regional bucket in us-central1: gs://bsp-nexus-backups. Versioning on, lifecycle to Nearline at 30 days, Coldline at 90, delete at 365 (Section 4.2). UBLA on. Restrict IAM to the nexus-runner SA + a humans-only audit role.

📝 Code, daily /opt/nexus backup script
# /opt/nexus/scripts/backup_daily.sh
#!/bin/bash
set -euo pipefail
DATE=$(date +%Y%m%d_%H%M%S)
ARCHIVE=/tmp/nexus-${DATE}.tar.zst

tar --zstd -cf $ARCHIVE \
  --exclude='/opt/nexus/.git' \
  --exclude='/opt/nexus/**/__pycache__' \
  --exclude='/opt/nexus/**/*.pyc' \
  /opt/nexus

gcloud storage cp $ARCHIVE gs://bsp-nexus-backups/daily/${DATE}.tar.zst
rm $ARCHIVE

# Optional: write a sentinel for the latest successful run
gcloud storage cp /dev/stdin gs://bsp-nexus-backups/_latest.txt <<< "$DATE"

# Cron entry
# 0 3 * * * /opt/nexus/scripts/backup_daily.sh >> /var/log/nexus-backup.log 2>&1

15.5 Cloud SQL evaluation for the WP database

The WordPress staging at bricks.callbrightside.com runs on Hostinger's MySQL. If we ever decide to bring WP on-platform (full GCP), the path is:

  1. Create a Cloud SQL MySQL 8.0 instance, 2 vCPU, 8 GB RAM, 100 GB SSD, HA enabled, automated backups, PITR enabled, maintenance window Saturday 03:00 UTC.
  2. Migrate via Database Migration Service (DMS). Set up continuous replication, validate, cutover.
  3. Update wp-config.php on a GCE-hosted PHP setup or App Engine to point at the Cloud SQL Auth Proxy socket.
  4. Use Secret Manager for the DB password.
  5. Take Cloud SQL backups daily, export weekly to GCS for cross-region DR.

Estimated incremental cost: ~$120/month for the HA Cloud SQL plus ~$5 storage. Decision deferred until WP scale or compliance forces it.

15.6 GCE firewall rules and hardening

The minimum firewall set for nexus-vm in production:

NameDirectionSourcePortsTargets
allow-ssh-iapINGRESS35.235.240.0/20tcp:22tag ssh-iap
allow-https-cfINGRESSCloudflare CIDR listtcp:443tag web
allow-internalINGRESS10.10.0.0/24allVPC-internal
deny-all-ingressINGRESS0.0.0.0/0all(catch-all, priority 65534)

OS-level hardening additionally: ufw or nftables mirroring the GCP firewall, fail2ban for SSH, automatic unattended upgrades enabled, root SSH disabled, password auth disabled, public key only.

📝 Code, baseline OS hardening on Debian/Ubuntu
sudo apt update && sudo apt install -y unattended-upgrades fail2ban
sudo dpkg-reconfigure -plow unattended-upgrades

# /etc/ssh/sshd_config tweaks
PermitRootLogin no
PasswordAuthentication no
KbdInteractiveAuthentication no
ChallengeResponseAuthentication no
ClientAliveInterval 300
ClientAliveCountMax 2

sudo systemctl reload ssh

15.7 Static IP status and snapshots schedule

📝 Code, verify nexus-vm IP is static, then create a snapshot policy
# Check static IP
gcloud compute addresses list --filter="address=34.55.179.122"

# Promote ephemeral to static if needed
gcloud compute addresses create nexus-vm-static \
  --addresses=34.55.179.122 --region=us-central1

# Create a daily snapshot schedule, retain 7 daily / 4 weekly / 12 monthly
gcloud compute resource-policies create snapshot-schedule nexus-daily \
  --region=us-central1 \
  --max-retention-days=14 \
  --start-time=07:00 --daily-schedule \
  --on-source-disk-delete=keep-auto-snapshots \
  --storage-location=us

# Attach to the boot disk
gcloud compute disks add-resource-policies nexus-vm \
  --zone=us-central1-a --resource-policies=nexus-daily

15.8 Load balancer + GCE backend for production scale

If we add a global Application Load Balancer for morpheus.callbrightside.com:

Benefit: TLS terminates on Google, can drop nexus-vm public IP, auto-scale to a MIG of 2 when needed without architecture rework.

15.9 Backup strategies, layered

LayerFrequencyRPORTOMethod
Boot disk snapshotsDaily24h15-30 minSnapshot schedule (Section 15.7)
/opt/nexus tarballDaily24h5-10 minCron + GCS (Section 15.4)
Git remoteOn every pushMinutes1-2 minGitHub origin
SecretsOn rotation~immediate1 minSecret Manager versions
External APIsN/AN/AN/AService-side responsibility (Hostinger, Cloudflare)

15.10 Monitoring nexus-vm via Ops Agent

See Section 6.7 for the Ops Agent install. The minimum alerts to wire up:

15.11 Secret Manager rotation for app secrets

Critical secrets to rotate on schedule:

Rotation pattern: add new version, update consumer code to read latest, monitor for ~24 hours, disable old version (do not destroy yet), verify, destroy after 7 days.

15.12 Disaster recovery, nexus-vm dies, how to rebuild

Assume the VM is gone (deleted, host failure, regional outage). Recovery procedure, ordered:

  1. Confirm in the console that the instance is in fact gone, not just stopped (gcloud compute instances list --filter=name=nexus-vm).
  2. If the instance was deleted but the boot disk was retained (--keep-disks=boot flag was on, default not), restart from the disk: gcloud compute instances create nexus-vm --zone=us-central1-a --boot-disk=nexus-vm with the prior SA, tags, network.
  3. If the boot disk is gone, restore from the latest snapshot: gcloud compute disks create nexus-vm --source-snapshot=nexus-daily-LATEST --zone=us-central1-a then create instance pointing at it.
  4. Reattach the static IP 34.55.179.122 via --address=nexus-vm-static.
  5. Validate that /opt/nexus is intact, run systemctl status nexus.service context-harness.service.
  6. If the entire region is down, restore in a different zone of us-central1; if all of us-central1 is down, the snapshot is multi-regional so you can build in us-east1 (different external IP, update Cloudflare DNS).
  7. Smoke test: curl https://morpheus.callbrightside.com, run a Zeus search, check Context Harness /healthz.
  8. Rotate the Anthropic, Cloudflare, and BRICKS_WP_APP_PASSWORD secrets just in case the disaster was a credential compromise.
📝 Code, the fast rebuild script (run from any machine with gcloud)
#!/bin/bash
set -euo pipefail

PROJECT=bsp-prod
ZONE=us-central1-a
VM=nexus-vm
SA=nexus-runner@${PROJECT}.iam.gserviceaccount.com

# 1. Find latest snapshot
LATEST=$(gcloud compute snapshots list \
  --filter="name~nexus-daily AND status=READY" \
  --sort-by=~creationTimestamp --limit=1 --format="value(name)")
echo "Restoring from snapshot: $LATEST"

# 2. Recreate boot disk
gcloud compute disks create $VM \
  --source-snapshot=$LATEST --zone=$ZONE --type=pd-balanced

# 3. Create instance from existing disk
gcloud compute instances create $VM \
  --zone=$ZONE --machine-type=n2-standard-2 \
  --disk=name=${VM},boot=yes,auto-delete=yes \
  --service-account=$SA --scopes=cloud-platform \
  --address=nexus-vm-static \
  --tags=web,ssh-iap \
  --shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring

# 4. Wait for boot, smoke test
sleep 60
gcloud compute ssh $VM --zone=$ZONE --tunnel-through-iap --command="systemctl status nexus.service"

15.13 Complete nexus-vm production architecture

nexus-vm production architecture, all external connections User browsers Robert, team, public TLS 1.3 Cloudflare DNS + CDN + WAF callbrightside.com zone GCP Edge 34.55.179.122 VPC firewall + IAP GCE VM nexus-vm, us-central1-a /opt/nexus python automation Context Harness :8765 (systemd) Zeus RAG, 19,679 chunks morpheus.callbrightside.com Secret Manager 5 app secrets GCS bsp-nexus-backups daily tarball + lifecycle Snapshot schedule 14d retention, multi-regional Cloud Logging + Monitoring Ops Agent ingest IAM nexus-runner SA least privilege Hostinger u227696829 WP Anthropic API Claude Opus 4.7 Vapi (Daniel AI) (913) 963-9817 OpenAI embeddings-3-small Cloudflare API DNS + cache purge ServiceTitan, BigSale data sources Solid lines = user/HTTP traffic; dashed = GCP service plane; dotted = SaaS API plane.
Figure 15.1, complete nexus-vm production architecture showing every external dependency and GCP service.
✅ Production checklist, nexus-vm integration
🎓 FOR NEW HIRE, the nexus-vm onboarding lap

Day 1: SSH to nexus-vm, cd /opt/nexus, run ls, git status, systemctl status nexus.service context-harness.service. Day 2: open morpheus.callbrightside.com and click around, run a Zeus search via the harness. Day 3: read this section end-to-end. Day 5: shadow Robert through the daily ops loop. Week 2: own the daily backup verification (does the GCS bucket have today's tarball). Week 3: own a non-critical change, write a Master History entry. Month 2: lead a DR drill end-to-end with Robert observing.


Appendices

Appendix A. gcloud CLI cheatsheet for single-VM ops

ActionCommand
Configure account/projectgcloud auth login · gcloud config set project bsp-prod · gcloud config set compute/zone us-central1-a
List VMsgcloud compute instances list
Describe nexus-vmgcloud compute instances describe nexus-vm --zone=us-central1-a
SSHgcloud compute ssh nexus-vm --zone=us-central1-a [--tunnel-through-iap]
Stop / startgcloud compute instances stop nexus-vm · gcloud compute instances start nexus-vm
Resizegcloud compute instances set-machine-type nexus-vm --machine-type=n2-standard-4 (stopped VM)
Resize diskgcloud compute disks resize nexus-vm --size=100GB then resize2fs
Snapshotgcloud compute disks snapshot nexus-vm --snapshot-names=manual-$(date +%Y%m%d)
List snapshotsgcloud compute snapshots list --filter="name~nexus"
List firewall rulesgcloud compute firewall-rules list
Add firewall rulegcloud compute firewall-rules create NAME --allow=tcp:443 --source-ranges=...
List addressesgcloud compute addresses list
Reserve static IPgcloud compute addresses create NAME --addresses=IP --region=us-central1
Read serial portgcloud compute instances get-serial-port-output nexus-vm
List service accountsgcloud iam service-accounts list
Get IAM policy on projectgcloud projects get-iam-policy bsp-prod
Add IAM bindinggcloud projects add-iam-policy-binding bsp-prod --member=... --role=...
List secretsgcloud secrets list
Read latest secretgcloud secrets versions access latest --secret=NAME
Add secret versionecho -n "VAL" | gcloud secrets versions add NAME --data-file=-
Tail logsgcloud logging read 'resource.type="gce_instance"' --limit=50 --order=desc
Stream logsgcloud alpha logging tail 'resource.type="gce_instance"'
List bucketsgcloud storage buckets list
Copy to GCSgcloud storage cp file.tar.gz gs://bsp-nexus-backups/
Download from GCSgcloud storage cp gs://bsp-nexus-backups/latest.tar.gz .
List enabled APIsgcloud services list --enabled
Run a connectivity testgcloud network-management connectivity-tests create ...
Show quotasgcloud compute regions describe us-central1 --format='value(quotas)'
Billing infogcloud billing projects describe bsp-prod

Appendix B. IAM roles → permissions matrix (single-VM relevant)

RoleKey permissionsUse
roles/compute.osLogincompute.instances.osLoginSSH as a regular user via OS Login
roles/compute.osAdminLogincompute.instances.osAdminLoginSSH as sudo via OS Login
roles/iap.tunnelResourceAccessoriap.tunnelInstances.accessViaIAPSSH through IAP tunnel
roles/compute.instanceAdmin.v1compute.instances.* (start, stop, delete, set-machine-type)Manage VM lifecycle
roles/compute.storageAdmincompute.disks.*, compute.snapshots.*Disks and snapshots
roles/compute.networkAdmincompute.networks.*, compute.firewalls.*, compute.routers.*VPC and firewalls
roles/storage.objectViewerstorage.objects.get, listRead GCS objects
roles/storage.objectAdminstorage.objects.*Read/write GCS objects (bucket-scope)
roles/storage.adminstorage.* (incl. buckets)Bucket admin, dangerous in prod
roles/secretmanager.secretAccessorsecretmanager.versions.accessRead latest/specific version
roles/secretmanager.secretVersionManagersecretmanager.versions.add, disableRotate secrets
roles/secretmanager.adminsecretmanager.*Create/delete secrets
roles/cloudkms.cryptoKeyEncrypterDecryptercloudkms.cryptoKeyVersions.useToEncrypt/DecryptUse a key
roles/logging.logWriterlogging.logEntries.createWrite log entries
roles/logging.viewerlogging.logEntries.listRead logs
roles/monitoring.metricWritermonitoring.timeSeries.createWrite custom metrics
roles/monitoring.viewermonitoring.* readView dashboards
roles/monitoring.editormonitoring.* writeEdit dashboards, alerts
roles/cloudtrace.agentcloudtrace.traces.patchSend traces
roles/errorreporting.writererrorreporting.errorEvents.createSend error events
roles/cloudsql.clientcloudsql.instances.connectConnect through Cloud SQL Auth Proxy
roles/cloudbuild.builds.editorcloudbuild.builds.*Run Cloud Build
roles/iam.serviceAccountTokenCreatoriam.serviceAccounts.signBlob, getAccessTokenSign on behalf of an SA
roles/iam.workloadIdentityUseriam.serviceAccounts.getOpenIdTokenWIF target binding
roles/run.invokerrun.routes.invokeCall a private Cloud Run service
roles/ownereverythingAvoid in production
roles/editoralmost everything except IAMAvoid in production
roles/viewerread most resourcesOK for read-only humans

Appendix C. Troubleshooting decision trees

C.1 VM unreachable via SSH

SSH timeout Is VM in RUNNING state? Can you reach the IP at all? Start it. Check serial console. Check sshd, OOM, fs full Check firewall rule allows :22 Run Connectivity Test from your IP Is OS Login on? IAP enabled? Grant osLogin / osAdminLogin Grant tunnelResourceAccessor
Figure C.1, "VM unreachable via SSH" decision tree.

C.2 API returning 403

HTTP 403 from API Is the API enabled? Caller identity correct? Org Policy or Deny in the way? gcloud services enable X Check ADC, gcloud config Console → Org policies / Deny Run IAM Troubleshooter
Figure C.2, "API returning 403" decision tree.

C.3 Costs spiking

Bill higher than expected Group BQ export by SKU Egress? Logging ingest? Snapshots? Find largest SKU rows Co-locate or use CDN/PGA Add sink filter exclusions Tighten retention policy Set budget + alert at 75%
Figure C.3, "Costs spiking" decision tree.

C.4 Logs not appearing

Expected logs missing Ops Agent running? SA has logging.logWriter? Sink exclusion dropping? systemctl status google-cloud-ops-agent Grant the role to nexus-runner SA Inspect sink filter expressions Check Logs Router page
Figure C.4, "Logs not appearing" decision tree.

Appendix D. Cost calculator examples (single-VM scenarios)

D.1 Baseline nexus-vm, n2-standard-2, 50 GB pd-balanced, 24/7

ComponentQuantityUnit priceMonthly
n2 vCPU (us-central1)2 x 730 hr$0.0317/hr$46.30
n2 RAM (us-central1)8 GiB x 730 hr$0.00425/hr$24.83
SUD ~10% (auto)applied--$7.10
pd-balanced50 GB$0.10/GB-mo$5.00
Static IP (in-use)730 hrfree while in-use$0.00
Egress (light)10 GB$0.085/GB$0.85
GCS backup (Standard)50 GB$0.020/GB-mo$1.00
Snapshot storage~30 GB$0.026/GB-mo$0.78
Logging5 GiB ingestfree under 50 GiB$0.00
Secret Manager10 active, 1k ops~$0.06/secret-mo$0.60
Subtotal$72.26

D.2 Baseline + 1-yr CUD on N2 (2 vCPU + 8 GiB)

ComponentEffectDelta
1-yr CUD on n2 vCPU + RAM~37% off-$22.70
SUD does not stackreplace SUD+$7.10
Adjusted total~$56.66/mo

D.3 Baseline + Cloud SQL for WP (HA, db-custom-2-8192, 100 GB)

ComponentMonthly add
Cloud SQL HA, 2 vCPU + 8 GiB~$190
Storage 100 GB SSD~$17
Backups (auto)~$5
Adjusted total~$284/mo

D.4 Baseline + Global Application Load Balancer

ComponentMonthly add
Forwarding rule (1)~$18 (+ data processing)
Cloud Armor base + WAF rules~$5/policy + per-request
Egress through LB$0.012/GB additional + standard egress
Adjusted total~$95-110/mo

Appendix E. Glossary

ADC
Application Default Credentials, the discovery order Google client libraries follow.
API
Application Programming Interface, here the HTTP/gRPC service endpoint Google exposes.
Artifact Registry
Google's package and container image repository, successor to Container Registry.
BQ
BigQuery, Google's serverless data warehouse.
CDN
Content Delivery Network, here Cloud CDN or Cloudflare.
CEL
Common Expression Language, used in IAM conditions and org policy custom constraints.
CIDR
Classless Inter-Domain Routing, an IP range like 10.10.0.0/24.
CMEK
Customer-Managed Encryption Key, key in your Cloud KMS used to encrypt a resource.
Cloud Armor
Google's WAF + DDoS protection for the Application Load Balancer.
Cloud Build
Hosted CI service.
Cloud Run
Serverless container service, scales 0 to N.
Cloud Shell
Browser-based Linux shell pre-loaded with gcloud.
CSEK
Customer-Supplied Encryption Key, raw key bytes per request, mostly deprecated.
CUD
Committed Use Discount, 1- or 3-year commitment for compute pricing.
DLP
Data Loss Prevention, now Sensitive Data Protection.
DR
Disaster Recovery, the practice of rebuilding after major failure.
Eventarc
Event router that bridges audit logs and Pub/Sub into Cloud Run.
GA
Generally Available, the highest stability level for a Google product.
GCE
Google Compute Engine, the IaaS VM service.
GCS
Google Cloud Storage, the object store.
GKE
Google Kubernetes Engine, managed K8s.
HA
High Availability, here a regional Cloud SQL configuration with synchronous standby.
HCL
HashiCorp Configuration Language, the syntax of Terraform.
HSM
Hardware Security Module, dedicated cryptographic hardware.
IAM
Identity and Access Management.
IAP
Identity-Aware Proxy, fronts VMs and apps with Google identity auth.
IaC
Infrastructure as Code, e.g. Terraform.
KMS
Key Management Service.
LB
Load Balancer.
MIG
Managed Instance Group, an autoscaled cluster of identical VMs.
MQL
Monitoring Query Language, advanced query syntax for Cloud Monitoring.
NCC
Network Connectivity Center, hub-and-spoke management for VPC and hybrid.
NIC
Network Interface Controller. Also Network Intelligence Center.
OS Login
SSH access tied to Google identity, IAM-controlled.
OWASP
Open Worldwide Application Security Project, source of common rule sets.
PD
Persistent Disk, the older block storage family. Hyperdisk is the new family.
PGA
Private Google Access, lets a private VM reach googleapis.com via Google's backbone.
PITR
Point-In-Time Recovery, restore to any second within retention window.
PSC
Private Service Connect, attaches a managed service at a private IP inside your VPC.
RAG
Retrieval-Augmented Generation, here the Zeus index of 19,679 chunks.
RPO
Recovery Point Objective, the maximum data loss tolerated.
RTO
Recovery Time Objective, the maximum downtime tolerated.
SA
Service Account, a Google identity for software.
SCC
Security Command Center, GCP's posture and findings dashboard.
SLI
Service Level Indicator, the metric.
SLO
Service Level Objective, the target.
SSO
Single Sign-On.
SSRF
Server-Side Request Forgery, where a server is tricked into fetching attacker-chosen URLs.
SUD
Sustained Use Discount, automatic discount for monthly compute usage.
TF
Terraform.
UBLA
Uniform Bucket-Level Access, IAM-only access control on a GCS bucket.
VPC
Virtual Private Cloud, the global software-defined network.
VPC-SC
VPC Service Controls, a security perimeter around managed services.
WAF
Web Application Firewall.
WIF
Workload Identity Federation, keyless auth from outside-GCP workloads.

Appendix F. Quick reference card

Project: bsp-prod · Region: us-central1 · Zone: us-central1-a · VM: nexus-vm · IP: 34.55.179.122

SSH today: ssh -i ~/.ssh/google_compute_engine dovew@34.55.179.122 · SSH bulletproof: gcloud compute ssh nexus-vm --zone=us-central1-a --tunnel-through-iap

Daily ops loop: systemctl status nexus.service context-harness.service · journalctl -u nexus.service -n 100 · df -h · free -m

Backup verification: gcloud storage ls gs://bsp-nexus-backups/daily/ | tail -3 · gcloud compute snapshots list --filter="name~nexus-daily" --sort-by=~creationTimestamp --limit=3

Read a secret: gcloud secrets versions access latest --secret=NAME

Tail logs: gcloud logging read 'resource.type="gce_instance"' --limit=50 --order=desc --format="value(timestamp,severity,jsonPayload.message)"

Cost dashboard: Console → Billing → Reports, group by SKU · Status: status.cloud.google.com

Incident first 5 minutes: (1) confirm symptom, (2) check status page, (3) gcloud compute instances describe nexus-vm, (4) Logs Explorer severity>=ERROR, (5) Connectivity Test from Cloudflare CIDR.

DR command: see Section 15.12 fast rebuild script.

Bulletproof rule: never the fast option, always best practice. Read first, build second. Receipts not narration.