Observability¶
How Starform collects, attributes, ships, stores, and serves customer telemetry — and how the platform watches itself. This is the self-contained build-and-operate guide for the engineer standing the pipeline up (§35).
Self-contained
You do not need anything else open to implement from this tab. The load-bearing contracts it
depends on (the HTTPRoute route-name format, the starform.io/* label set, the tenant key, the
server-side query filter) are restated inline in each chapter and collected in the
Reference with provenance tags. Owned by §20.2 / §24 / §35 — those pages win on
conflict. When a contract moves, update here too, guided by the
sync checklist.
It is in four parts. Part A (this page) is the whole system in plain words. Part B (chapters 1–9) explains how to build each piece. Part C (Operate) is what to do when a piece breaks. Part D (Reference) carries FR/SC traceability and the inlined contracts. Scope: metrics + logs; tracing & preview-env L7 are stubs.
Read this first — in plain words¶
If you read nothing else, read this part. It is the whole system with no jargon.
The components — what each piece is¶
Before the data-flow, here's the cast. Everything obeys one rule — Starbase decides, Shuttle applies; every change flows through desired state. The pieces are grouped by where they run.
The component map. Starform's pieces, grouped by where they run. Brand-blue = Starform-built · gray = third-party · darker = a data store · dashed = external SaaS. The telemetry path threads bottom-up — cluster collectors (Fluent Bit, vmagent) → regional stores → Stardeck.
The mental model¶
The mental model
Monitoring is two separate pipes — one for metrics, one for logs — that never touch until they meet on the dashboard. Alongside them runs a third plane (Grafana Cloud, external) that watches Starform's own machinery for the SRE team.
Nobody "queries a new pod." Metrics are pulled on a timer, logs are tailed off the machine, and traces (later) are pushed by the app. Three signals, three habits.
Diagram 1 — VPC topology. The regional VPC (one per region) holds the customer cluster and its telemetry VM stores, so ingest is intra-VPC at $0. The central control plane (its own VPC) peers in to read each region's stores — only through the authed front-door, in two parts: vmauth (a proxy deployment) for VictoriaMetrics, and a read-only ClickHouse user (native RBAC, not a service) for logs. The stores stay private; Starbase injects the tenant filter. One VPC + one peering per region (~12 ≪ 50). A dedicated Grafana Alloy agent ships platform series to Grafana Cloud (external) over the internet. Arrow colour: blue = metrics pipe · amber = logs pipe. At MVP the control plane is single-region (co-located), so there is no cross-region read yet.
The lifecycle in one breath¶
A pod is born → on the next timer tick its meters get read and its log file gets tailed (no one
contacts the pod) → each signal is tagged with project · environment · service → shipped over the
private network → stored (metrics in VictoriaMetrics, logs in ClickHouse) → the dashboard shows it
through a filter the user can't remove. A pod dies → readings stop, series age out by retention, the
log file disappears. Nothing to clean up.
| Step | What happens | |
|---|---|---|
| 01 | Born | Shuttle creates the pod, its labels, and its HTTPRoute. |
| 02 | Read & tailed | Next tick: vmagent scrapes kubelet + Envoy; Fluent Bit picks up the new log file. |
| 03 | Tagged | Identity attached — by join, by parse, or by label. |
| 04 | Shipped | Over the regional VPC (intra-VPC, $0) with a per-cluster token. |
| 05 | Stored | Metrics → VictoriaMetrics. Logs → ClickHouse. |
| 06 | Served | API forces a tenant filter; Stardeck renders it. |
What's in, and what's a stub¶
| Area | Status | Notes |
|---|---|---|
| Customer metrics — latency, RPS, throughput, error rate, CPU, memory, network | In MVP | Seven signals from two sources (Diagram 3). |
| Customer logs — runtime + build, live tail | In MVP | Fluent Bit → Vector aggregator → ClickHouse. |
| Platform self-monitoring — alerts on Starform's own components | In MVP | Grafana Cloud (external), fed by a dedicated Grafana Alloy agent. |
| Preview-env per-route L7 metrics | Stub | Previews still get logs + CPU/mem. Per-route latency/RPS handling (short retention vs aggregate vs suppress) is to be architected — see Open items. |
| Distributed tracing | Stub | Push model, needs app instrumentation. Future plane — sketched, not built. |
| Per-tier metric retention | Deferred | MVP = global 90d. Logs already tier cleanly. |
Next: Chapter 1 · Architecture & components — namespaces, the HTTPRoute name, and the label set, with the three identity conventions everything below depends on.