Operate¶
Part of the self-contained SRE guide
This is the operate layer — what breaks, how you know, and what you do. It is the part the PRD deliberately doesn't carry. The failure modes below reference contracts restated across this guide and collected in the Reference. The PRD wins on conflict.
The part the PRD deliberately doesn't carry: what breaks, how you know, and what you do.
Failure modes & runbooks¶
| Failure | Symptom | Response |
|---|---|---|
| kube-state-metrics down | CPU/mem/network lose tenant identity (the join has nothing to match); Envoy metrics unaffected. | Alert on KSM scrape health. CPU/mem dashboards degrade, not billing. Restore KSM; the join recovers automatically. |
| Envoy GW upgrade changes cluster-name format | Latency/RPS/error attribution breaks across the whole fleet at once. | Caught by the /stats/prometheus snapshot test before rollout. Update the relabel regex, re-snapshot. Never bump EG without re-running the test. |
| Join right-side not unique | Recording-rule eval fails ("multiple matches for labels"). | Find the duplicate kube_pod_labels series; ensure one row per (namespace,pod). |
| Vector aggregator / ClickHouse down | Logs stop arriving. | Agents disk-buffer and replay on recovery. Alert on aggregator health and ClickHouse insert latency. |
| VictoriaMetrics sink outage | Metrics gap. | vmagent's persistent queue buffers and replays. Alert on remote_write backlog growth. |
| Shuttle informer freezes | Billing meters off a stale pod view, silently. | Alert: time() - starform_informer_last_event_timestamp_seconds > 300. Restart Shuttle; cache rebuilds. |
| Preview-route churn | TSDB index bloat from create/destroy of full series sets. | Preview routes skip per-route L7 metrics today (a deferred decision); finalize short-retention or aggregation when the preview design lands. |
Capacity & scale¶
Pods drive the numbers, not projects. At a 1,000-pod soft limit (≈2,000 in edge cases, new cluster provisioned near 1,000):
- cAdvisor — ~10–15 kept series/pod → ~25–30k at 2,000 pods.
- Envoy — per route (a fraction of pods; only web services get one) × 2 gateway replicas → ~25–35k.
- Total ≈ 70–100k active series/cluster. Single-node VictoriaMetrics handles millions — you stay 1–2 orders of magnitude under.
- Gateway config plane is the real limit, not the store. Envoy Gateway has been benchmarked to ~2,000 routes with linear resource growth; the only knob is raising the xDS gRPC message size (4MB → 25MB). You sit below that at your route counts.
Note
Verify these against your own load before launch — they're planning estimates, not measured. The two things that actually bite at this scale are correctness (the join's uniqueness, the Envoy-format coupling), not raw volume.
Dashboards (Stardeck)¶
How to build it
Stardeck queries flat, pre-computed series (starform:service:*) through the API's tenant
filter — never the raw join, never raw envoy_cluster_name. Surface the seven metrics per
service, with latency as percentiles (p50/p95/p99) and error rate as the headline 5xx ratio.
Target metric freshness within ~60s of emission (scrape + remote_write interval).
SC-018
Gotcha
The backend does no string parsing and runs no joins at query time — both happen upstream, before the data lands. If you find yourself parsing names in the API, the pipeline is misconfigured.