Reference¶
Part of the self-contained SRE guide
This is Part D — the standalone reference for the SRE. The Inlined contracts table below is the self-contained restatement of every load-bearing contract this guide depends on; each cell names its home PRD section as provenance. If they ever disagree, the PRD wins. When a PRD contract moves, update this guide too, guided by the sync checklist.
Open items & stubs¶
| Item | State | What's undecided |
|---|---|---|
| Preview-env L7 metrics | Stub | Previews already get logs + CPU/mem. Open: per-route latency/RPS handling — short retention, pre-aggregation, or suppression. Competitors (Railway, Render) expose preview observability, so full suppression is a known gap; decide the tradeoff when you architect previews. Previews are flagged is_ephemeral (structural, not name-matched), so suppression keys off the flag. |
| Distributed tracing | Stub | Push model: needs an OpenTelemetry SDK in customer code → OTel collector → trace store. The app-cooperation requirement is exactly why it's deferred (metrics/logs need zero app changes). Future plane. |
| Per-tier metric retention | Deferred | Needs downsampling or multiple VM instances. MVP = global 90d. |
| Custom-metric HPA (RPS/latency) | Deferred | Needs prometheus-adapter (or KEDA) exposing the regional VM through the custom-metrics API. MVP scales on CPU/mem via metrics-server only. |
FR / SC traceability¶
| Requirement | Covered in |
|---|---|
| FR-049 — live log streaming | Ch. 3 Logs |
| FR-050 — CPU/memory per service | Ch. 2 Metrics (cAdvisor path) |
| FR-051 — tiered retention | Ch. 7 Retention |
| FR-063 — L7 metrics (latency/RPS/error/throughput) | Ch. 2 Metrics (Envoy path) |
| FR-064 — tenant-key attribution | Ch. 2 Metrics; Diagram 3 |
| FR-065 — server-side query isolation | Ch. 6 Tenant isolation |
| FR-066 — private transport + per-cluster auth | Ch. 4 Transport |
| FR-067 — Envoy cardinality discipline | Ch. 2 Metrics (gotchas) |
| FR-068 — metrics-server at bootstrap | Ch. 8 Bootstrap |
| FR-069 — platform self-monitoring (Grafana Cloud) | Ch. 5 SRE plane |
| FR-070 — environment-name validation (RFC 1123 ≤30) | Inlined contracts (below) |
| SC-016 — no cross-tenant data | Ch. 6 (isolation test) |
| SC-017 — alert latency < 2 min | Ch. 5 SRE plane |
| SC-018 — metric freshness ~60s; billing-staleness alert | Ch. 9 Dashboards; Ch. 5 |
Inlined contracts & sync checklist¶
Note
This guide is self-contained — you do not need the PRD open to implement from it. The load-bearing contracts below are restated here; the PRD remains canonical (each cell names its home section as provenance — if they ever disagree, the PRD wins). When the PRD changes one of these, update this doc too, guided by the sync checklist.
| Contract | The rule (inlined) | Owner · PRD |
|---|---|---|
| Tenant identity key | project_id · environment · service_id is the attribution key on every metric and log. workspace_id rides along as the billing-boundary label, not part of the tenant key. There is no customer_id. |
§24.1 · FR-064 |
| HTTPRoute name + parse | Route name = <project_uuid><service_uuid>-<environment>, each UUID hyphen-stripped to 32 hex chars. Parse positionally: chars[0:32] = project, [32:64] = service, the segment after the separating hyphen = environment (hyphens inside the env name are fine — the first 64 chars are fixed-width). This is the only customer identity on Envoy per-route metrics. |
§20.2 |
| Environment-name rule | Customer-chosen, validated as an RFC 1123 label: lowercase [a-z0-9-], start/end alphanumeric, ≤30 chars. No fixed dev/staging/prod enum. Preview/ephemeral envs carry an is_ephemeral flag (structural, not name-matched) — that flag is what suppresses per-route preview L7 metrics. |
§20.2 · §24.1 · FR-070 |
| Required labels | Every Shuttle-created resource carries starform.io/: managed-by=shuttle, workspace-id, project-id, environment, service-id, service-name, cluster-id (+ var-group-id on Secrets, tier on pods). KSM must be told to expose them: --metric-labels-allowlist=pods=[starform.io/...]. |
§24 · §24B |
| Query isolation | The API injects the tenant filter server-side from the authenticated session — {project_id="X",environment="…"} into PromQL, WHERE project_id='X' into ClickHouse. The client can never supply or override it; it applies on the cross-region read path too — which runs over private VPC peering through an authed front-door (vmauth + read-only ClickHouse user), never a public endpoint. |
FR-065 · FR-071 |
| Retention | Logs: per-tier TTL by ClickHouse partition. Metrics: global 90 d on single-node VictoriaMetrics at MVP (per-tier deferred). | FR-051 |
Sync checklist — re-verify when the PRD moves¶
- ① HTTPRoute parse (§20.2) — highest risk. Envoy-Gateway-version-sensitive. On any EG bump,
re-snapshot
/stats/prometheus, re-verify the regex, update both docs. The one most likely to silently break attribution. - ② Label catalog (§24 / §24B). If a required
starform.io/*label is added or renamed, update the KSM allowlist and the cAdvisor join. - ③ Tenant key (§24.1). Any change to the identity tuple ripples through relabeling, recording rules, and the query filter.
- ④ Topology & transport (§4.1, §35.4). Control-plane region / per-region VM stores / Grafana Cloud — if the topology shifts, Diagrams 1/2 and chapters 1/4/5/8 follow.
- ⑤ FR/SC (§40). FR-049/050/051/063–070, SC-016–018 — keep the traceability table above in step.
Starform · Observability Architecture · companion to Master PRD · this document is the "how to build it" layer; the PRD remains the source of truth for contracts.