Platform self-monitoring (the SRE plane)¶
Part of the self-contained SRE guide
This chapter restates the GDPR labeldrop rule (keep customer IDs out of the platform series) and
the Shuttle-informer staleness signal inline so the guide stands alone. Owned by PRD §35.5 /
§26.1 — the PRD wins on conflict. The Reference collects every inlined contract
with its provenance.
In plain words
A dedicated Grafana Alloy agent in each customer cluster scrapes the health of Starform's own pieces and ships it to Grafana Cloud (external SaaS). There, Grafana Alerting checks rules and OnCall pages the SRE. It runs as its own agent, separate from the customer pipeline, for two reasons: an external watcher survives the infra dying (which a plane co-located in your own cluster cannot), and an independent pipeline means a vmagent problem can't blind you to the platform's own health. MVP-only; revisit post-MVP.
How to build it
Two pieces: an agent that ships platform health out, and the rules that act on it.
1 · Deploy Grafana Alloy per cluster (Grafana's k8s-monitoring Helm chart). It scrapes
Starform-component series — Shuttle /metrics, Envoy gateway-self, node-exporter, KSM platform
objects — and remote_writes them to Grafana Cloud; vmagent separately ships only the customer
series to VictoriaMetrics. GDPR: drop customer identifiers before they leave the cluster (the
labeldrop below), or use an EU-residency stack:
// Alloy (River): keep only platform jobs, strip tenant labels (GDPR), remote_write to Grafana Cloud
discovery.kubernetes "pods" { role = "pod" }
prometheus.scrape "platform" {
targets = discovery.kubernetes.pods.targets
forward_to = [prometheus.relabel.platform.receiver]
}
prometheus.relabel "platform" {
forward_to = [prometheus.remote_write.grafana.receiver]
rule { source_labels = ["job"], regex = "shuttle|node-exporter|kube-state-metrics|envoy-self", action = "keep" }
rule { regex = "project_id|service_id|workspace_id", action = "labeldrop" } // never ship customer IDs
}
prometheus.remote_write "grafana" {
endpoint {
url = "https://prometheus-prod-NN.grafana.net/api/prom/push"
basic_auth {
username = "<grafana-cloud-instance-id>"
password = "<grafana-cloud-token>" // inject from a Secret, not inline
}
}
}
2 · Load the platform alert rules into Grafana Cloud (mimirtool) or author them in Grafana
Alerting; Grafana OnCall pages the SRE. Watch Starbase API/Worker, control-plane Postgres,
ClickHouse, VictoriaMetrics, Envoy health, DO Load Balancers, and Shuttle's /metrics:
groups:
- name: starform-platform
rules:
- alert: ShuttleInformerStale # billing meters off a frozen view
expr: time() - starform_informer_last_event_timestamp_seconds{resource="pods"} > 300
for: 2m
labels: { severity: critical }
annotations: { summary: "Shuttle informer stale on {{ $labels.cluster_id }}" }
- alert: StarbaseAPIDown
expr: up{job="starbase-api"} == 0
for: 1m
labels: { severity: critical }
Gotchas & what lives elsewhere
- Watch the watcher's blind spot: alert on
starform_informer_last_event_timestamp_seconds— if Shuttle's informer silently freezes, billing meters off a stale view with no reconcile error. - Grafana Cloud is MVP-only — the post-MVP option is self-hosting platform monitoring once the SRE team can operate it (§39.3 #42). Either way, customer telemetry never goes to Grafana Cloud — only Starform-component series.
PRD reference & inlined contracts
Owned by §35.5 (platform self-monitoring), §26.1 (Shuttle self-obs); FR-069, SC-017. The GDPR
labeldrop rule and the informer-staleness signal are restated above so this guide stands
alone — if they ever diverge, the PRD wins. Canonical map:
Canonical Sources.