Skip to content

Reference

Part of the self-contained SRE guide

This is Part D — the standalone reference for the SRE. The Inlined contracts table below is the self-contained restatement of every load-bearing contract this guide depends on; each cell names its home PRD section as provenance. If they ever disagree, the PRD wins. When a PRD contract moves, update this guide too, guided by the sync checklist.

Open items & stubs

Item State What's undecided
Preview-env L7 metrics Stub Previews already get logs + CPU/mem. Open: per-route latency/RPS handling — short retention, pre-aggregation, or suppression. Competitors (Railway, Render) expose preview observability, so full suppression is a known gap; decide the tradeoff when you architect previews. Previews are flagged is_ephemeral (structural, not name-matched), so suppression keys off the flag.
Distributed tracing Stub Push model: needs an OpenTelemetry SDK in customer code → OTel collector → trace store. The app-cooperation requirement is exactly why it's deferred (metrics/logs need zero app changes). Future plane.
Per-tier metric retention Deferred Needs downsampling or multiple VM instances. MVP = global 90d.
Custom-metric HPA (RPS/latency) Deferred Needs prometheus-adapter (or KEDA) exposing the regional VM through the custom-metrics API. MVP scales on CPU/mem via metrics-server only.

FR / SC traceability

Requirement Covered in
FR-049 — live log streaming Ch. 3 Logs
FR-050 — CPU/memory per service Ch. 2 Metrics (cAdvisor path)
FR-051 — tiered retention Ch. 7 Retention
FR-063 — L7 metrics (latency/RPS/error/throughput) Ch. 2 Metrics (Envoy path)
FR-064 — tenant-key attribution Ch. 2 Metrics; Diagram 3
FR-065 — server-side query isolation Ch. 6 Tenant isolation
FR-066 — private transport + per-cluster auth Ch. 4 Transport
FR-067 — Envoy cardinality discipline Ch. 2 Metrics (gotchas)
FR-068 — metrics-server at bootstrap Ch. 8 Bootstrap
FR-069 — platform self-monitoring (Grafana Cloud) Ch. 5 SRE plane
FR-070 — environment-name validation (RFC 1123 ≤30) Inlined contracts (below)
SC-016 — no cross-tenant data Ch. 6 (isolation test)
SC-017 — alert latency < 2 min Ch. 5 SRE plane
SC-018 — metric freshness ~60s; billing-staleness alert Ch. 9 Dashboards; Ch. 5

Inlined contracts & sync checklist

Note

This guide is self-contained — you do not need the PRD open to implement from it. The load-bearing contracts below are restated here; the PRD remains canonical (each cell names its home section as provenance — if they ever disagree, the PRD wins). When the PRD changes one of these, update this doc too, guided by the sync checklist.

Contract The rule (inlined) Owner · PRD
Tenant identity key project_id · environment · service_id is the attribution key on every metric and log. workspace_id rides along as the billing-boundary label, not part of the tenant key. There is no customer_id. §24.1 · FR-064
HTTPRoute name + parse Route name = <project_uuid><service_uuid>-<environment>, each UUID hyphen-stripped to 32 hex chars. Parse positionally: chars[0:32] = project, [32:64] = service, the segment after the separating hyphen = environment (hyphens inside the env name are fine — the first 64 chars are fixed-width). This is the only customer identity on Envoy per-route metrics. §20.2
Environment-name rule Customer-chosen, validated as an RFC 1123 label: lowercase [a-z0-9-], start/end alphanumeric, ≤30 chars. No fixed dev/staging/prod enum. Preview/ephemeral envs carry an is_ephemeral flag (structural, not name-matched) — that flag is what suppresses per-route preview L7 metrics. §20.2 · §24.1 · FR-070
Required labels Every Shuttle-created resource carries starform.io/: managed-by=shuttle, workspace-id, project-id, environment, service-id, service-name, cluster-id (+ var-group-id on Secrets, tier on pods). KSM must be told to expose them: --metric-labels-allowlist=pods=[starform.io/...]. §24 · §24B
Query isolation The API injects the tenant filter server-side from the authenticated session — {project_id="X",environment="…"} into PromQL, WHERE project_id='X' into ClickHouse. The client can never supply or override it; it applies on the cross-region read path too — which runs over private VPC peering through an authed front-door (vmauth + read-only ClickHouse user), never a public endpoint. FR-065 · FR-071
Retention Logs: per-tier TTL by ClickHouse partition. Metrics: global 90 d on single-node VictoriaMetrics at MVP (per-tier deferred). FR-051

Sync checklist — re-verify when the PRD moves

  • ① HTTPRoute parse (§20.2) — highest risk. Envoy-Gateway-version-sensitive. On any EG bump, re-snapshot /stats/prometheus, re-verify the regex, update both docs. The one most likely to silently break attribution.
  • ② Label catalog (§24 / §24B). If a required starform.io/* label is added or renamed, update the KSM allowlist and the cAdvisor join.
  • ③ Tenant key (§24.1). Any change to the identity tuple ripples through relabeling, recording rules, and the query filter.
  • ④ Topology & transport (§4.1, §35.4). Control-plane region / per-region VM stores / Grafana Cloud — if the topology shifts, Diagrams 1/2 and chapters 1/4/5/8 follow.
  • ⑤ FR/SC (§40). FR-049/050/051/063–070, SC-016–018 — keep the traceability table above in step.

Starform · Observability Architecture · companion to Master PRD · this document is the "how to build it" layer; the PRD remains the source of truth for contracts.