Operate¶

Part of the self-contained SRE guide

This is the operate layer — what breaks, how you know, and what you do. It is the part the PRD deliberately doesn't carry. The failure modes below reference contracts restated across this guide and collected in the Reference. The PRD wins on conflict.

The part the PRD deliberately doesn't carry: what breaks, how you know, and what you do.

Failure modes & runbooks¶

Failure	Symptom	Response
kube-state-metrics down	CPU/mem/network lose tenant identity (the join has nothing to match); Envoy metrics unaffected.	Alert on KSM scrape health. CPU/mem dashboards degrade, not billing. Restore KSM; the join recovers automatically.
Envoy GW upgrade changes cluster-name format	Latency/RPS/error attribution breaks across the whole fleet at once.	Caught by the `/stats/prometheus` snapshot test before rollout. Update the relabel regex, re-snapshot. Never bump EG without re-running the test.
Join right-side not unique	Recording-rule eval fails ("multiple matches for labels").	Find the duplicate `kube_pod_labels` series; ensure one row per `(namespace,pod)`.
Vector aggregator / ClickHouse down	Logs stop arriving.	Agents disk-buffer and replay on recovery. Alert on aggregator health and ClickHouse insert latency.
VictoriaMetrics sink outage	Metrics gap.	vmagent's persistent queue buffers and replays. Alert on `remote_write` backlog growth.
Shuttle informer freezes	Billing meters off a stale pod view, silently.	Alert: `time() - starform_informer_last_event_timestamp_seconds > 300`. Restart Shuttle; cache rebuilds.
Preview-route churn	TSDB index bloat from create/destroy of full series sets.	Preview routes skip per-route L7 metrics today (a deferred decision); finalize short-retention or aggregation when the preview design lands.

Capacity & scale¶

Pods drive the numbers, not projects. At a 1,000-pod soft limit (≈2,000 in edge cases, new cluster provisioned near 1,000):

cAdvisor — ~10–15 kept series/pod → ~25–30k at 2,000 pods.
Envoy — per route (a fraction of pods; only web services get one) × 2 gateway replicas → ~25–35k.
Total ≈ 70–100k active series/cluster. Single-node VictoriaMetrics handles millions — you stay 1–2 orders of magnitude under.
Gateway config plane is the real limit, not the store. Envoy Gateway has been benchmarked to ~2,000 routes with linear resource growth; the only knob is raising the xDS gRPC message size (4MB → 25MB). You sit below that at your route counts.

Note

Verify these against your own load before launch — they're planning estimates, not measured. The two things that actually bite at this scale are correctness (the join's uniqueness, the Envoy-format coupling), not raw volume.

Dashboards (Stardeck)¶

How to build it

Stardeck queries flat, pre-computed series (starform:service:*) through the API's tenant filter — never the raw join, never raw envoy_cluster_name. Surface the seven metrics per service, with latency as percentiles (p50/p95/p99) and error rate as the headline 5xx ratio. Target metric freshness within ~60s of emission (scrape + remote_write interval). SC-018

Gotcha

The backend does no string parsing and runs no joins at query time — both happen upstream, before the data lands. If you find yourself parsing names in the API, the pipeline is misconfigured.