Skip to content

Platform self-monitoring (the SRE plane)

Part of the self-contained SRE guide

This chapter restates the GDPR labeldrop rule (keep customer IDs out of the platform series) and the Shuttle-informer staleness signal inline so the guide stands alone. Owned by PRD §35.5 / §26.1 — the PRD wins on conflict. The Reference collects every inlined contract with its provenance.

In plain words

A dedicated Grafana Alloy agent in each customer cluster scrapes the health of Starform's own pieces and ships it to Grafana Cloud (external SaaS). There, Grafana Alerting checks rules and OnCall pages the SRE. It runs as its own agent, separate from the customer pipeline, for two reasons: an external watcher survives the infra dying (which a plane co-located in your own cluster cannot), and an independent pipeline means a vmagent problem can't blind you to the platform's own health. MVP-only; revisit post-MVP.

How to build it

Two pieces: an agent that ships platform health out, and the rules that act on it.

1 · Deploy Grafana Alloy per cluster (Grafana's k8s-monitoring Helm chart). It scrapes Starform-component series — Shuttle /metrics, Envoy gateway-self, node-exporter, KSM platform objects — and remote_writes them to Grafana Cloud; vmagent separately ships only the customer series to VictoriaMetrics. GDPR: drop customer identifiers before they leave the cluster (the labeldrop below), or use an EU-residency stack:

Grafana Alloy · config.alloy (River)
// Alloy (River): keep only platform jobs, strip tenant labels (GDPR), remote_write to Grafana Cloud
discovery.kubernetes "pods" { role = "pod" }

prometheus.scrape "platform" {
  targets    = discovery.kubernetes.pods.targets
  forward_to = [prometheus.relabel.platform.receiver]
}

prometheus.relabel "platform" {
  forward_to = [prometheus.remote_write.grafana.receiver]
  rule { source_labels = ["job"], regex = "shuttle|node-exporter|kube-state-metrics|envoy-self", action = "keep" }
  rule { regex = "project_id|service_id|workspace_id", action = "labeldrop" }   // never ship customer IDs
}

prometheus.remote_write "grafana" {
  endpoint {
    url = "https://prometheus-prod-NN.grafana.net/api/prom/push"
    basic_auth {
      username = "<grafana-cloud-instance-id>"
      password = "<grafana-cloud-token>"   // inject from a Secret, not inline
    }
  }
}

2 · Load the platform alert rules into Grafana Cloud (mimirtool) or author them in Grafana Alerting; Grafana OnCall pages the SRE. Watch Starbase API/Worker, control-plane Postgres, ClickHouse, VictoriaMetrics, Envoy health, DO Load Balancers, and Shuttle's /metrics:

Grafana alert rules · mimirtool
groups:
  - name: starform-platform
    rules:
      - alert: ShuttleInformerStale       # billing meters off a frozen view
        expr: time() - starform_informer_last_event_timestamp_seconds{resource="pods"} > 300
        for: 2m
        labels:      { severity: critical }
        annotations: { summary: "Shuttle informer stale on {{ $labels.cluster_id }}" }
      - alert: StarbaseAPIDown
        expr: up{job="starbase-api"} == 0
        for: 1m
        labels:      { severity: critical }

Gotchas & what lives elsewhere

  • Watch the watcher's blind spot: alert on starform_informer_last_event_timestamp_seconds — if Shuttle's informer silently freezes, billing meters off a stale view with no reconcile error.
  • Grafana Cloud is MVP-only — the post-MVP option is self-hosting platform monitoring once the SRE team can operate it (§39.3 #42). Either way, customer telemetry never goes to Grafana Cloud — only Starform-component series.

PRD reference & inlined contracts

Owned by §35.5 (platform self-monitoring), §26.1 (Shuttle self-obs); FR-069, SC-017. The GDPR labeldrop rule and the informer-staleness signal are restated above so this guide stands alone — if they ever diverge, the PRD wins. Canonical map: Canonical Sources.