Observability¶

How Starform collects, attributes, ships, stores, and serves customer telemetry — and how the platform watches itself. This is the self-contained build-and-operate guide for the engineer standing the pipeline up (§35).

Self-contained

You do not need anything else open to implement from this tab. The load-bearing contracts it depends on (the HTTPRoute route-name format, the starform.io/* label set, the tenant key, the server-side query filter) are restated inline in each chapter and collected in the Reference with provenance tags. Owned by §20.2 / §24 / §35 — those pages win on conflict. When a contract moves, update here too, guided by the sync checklist.

It is in four parts. Part A (this page) is the whole system in plain words. Part B (chapters 1–9) explains how to build each piece. Part C (Operate) is what to do when a piece breaks. Part D (Reference) carries FR/SC traceability and the inlined contracts. Scope: metrics + logs; tracing & preview-env L7 are stubs.

Read this first — in plain words¶

If you read nothing else, read this part. It is the whole system with no jargon.

The components — what each piece is¶

Before the data-flow, here's the cast. Everything obeys one rule — Starbase decides, Shuttle applies; every change flows through desired state. The pieces are grouped by where they run.

The component map. Starform's pieces, grouped by where they run. Brand-blue = Starform-built · gray = third-party · darker = a data store · dashed = external SaaS. The telemetry path threads bottom-up — cluster collectors (Fluent Bit, vmagent) → regional stores → Stardeck.

The mental model¶

The mental model

Monitoring is two separate pipes — one for metrics, one for logs — that never touch until they meet on the dashboard. Alongside them runs a third plane (Grafana Cloud, external) that watches Starform's own machinery for the SRE team.

Nobody "queries a new pod." Metrics are pulled on a timer, logs are tailed off the machine, and traces (later) are pushed by the app. Three signals, three habits.

Diagram 1 — VPC topology. The regional VPC (one per region) holds the customer cluster and its telemetry VM stores, so ingest is intra-VPC at $0. The central control plane (its own VPC) peers in to read each region's stores — only through the authed front-door, in two parts: vmauth (a proxy deployment) for VictoriaMetrics, and a read-only ClickHouse user (native RBAC, not a service) for logs. The stores stay private; Starbase injects the tenant filter. One VPC + one peering per region (~12 ≪ 50). A dedicated Grafana Alloy agent ships platform series to Grafana Cloud (external) over the internet. Arrow colour: blue = metrics pipe · amber = logs pipe. At MVP the control plane is single-region (co-located), so there is no cross-region read yet.

The lifecycle in one breath¶

A pod is born → on the next timer tick its meters get read and its log file gets tailed (no one contacts the pod) → each signal is tagged with project · environment · service → shipped over the private network → stored (metrics in VictoriaMetrics, logs in ClickHouse) → the dashboard shows it through a filter the user can't remove. A pod dies → readings stop, series age out by retention, the log file disappears. Nothing to clean up.

Step		What happens
01	Born	Shuttle creates the pod, its labels, and its HTTPRoute.
02	Read & tailed	Next tick: vmagent scrapes kubelet + Envoy; Fluent Bit picks up the new log file.
03	Tagged	Identity attached — by join, by parse, or by label.
04	Shipped	Over the regional VPC (intra-VPC, $0) with a per-cluster token.
05	Stored	Metrics → VictoriaMetrics. Logs → ClickHouse.
06	Served	API forces a tenant filter; Stardeck renders it.

What's in, and what's a stub¶

Area	Status	Notes
Customer metrics — latency, RPS, throughput, error rate, CPU, memory, network	In MVP	Seven signals from two sources (Diagram 3).
Customer logs — runtime + build, live tail	In MVP	Fluent Bit → Vector aggregator → ClickHouse.
Platform self-monitoring — alerts on Starform's own components	In MVP	Grafana Cloud (external), fed by a dedicated Grafana Alloy agent.
Preview-env per-route L7 metrics	Stub	Previews still get logs + CPU/mem. Per-route latency/RPS handling (short retention vs aggregate vs suppress) is to be architected — see Open items.
Distributed tracing	Stub	Push model, needs app instrumentation. Future plane — sketched, not built.
Per-tier metric retention	Deferred	MVP = global 90d. Logs already tier cleanly.

Next: Chapter 1 · Architecture & components — namespaces, the HTTPRoute name, and the label set, with the three identity conventions everything below depends on.