Skip to content

Observability

How Starform collects, attributes, ships, stores, and serves customer telemetry — and how the platform watches itself. This is the self-contained build-and-operate guide for the engineer standing the pipeline up (§35).

Self-contained

You do not need anything else open to implement from this tab. The load-bearing contracts it depends on (the HTTPRoute route-name format, the starform.io/* label set, the tenant key, the server-side query filter) are restated inline in each chapter and collected in the Reference with provenance tags. Owned by §20.2 / §24 / §35 — those pages win on conflict. When a contract moves, update here too, guided by the sync checklist.

It is in four parts. Part A (this page) is the whole system in plain words. Part B (chapters 19) explains how to build each piece. Part C (Operate) is what to do when a piece breaks. Part D (Reference) carries FR/SC traceability and the inlined contracts. Scope: metrics + logs; tracing & preview-env L7 are stubs.


Read this first — in plain words

If you read nothing else, read this part. It is the whole system with no jargon.

The components — what each piece is

Before the data-flow, here's the cast. Everything obeys one rule — Starbase decides, Shuttle applies; every change flows through desired state. The pieces are grouped by where they run.

Starbase decides · Shuttle applies — everything flows through desired state Control plane · one region at MVP Starbasecontrol plane · decides what runs Stardeckdashboard · Mission Control obs view Starforgebuild · git push → image Each customer cluster · per region Shuttleapplies desired state Envoy GWL7 routing + metrics Fluent Bitlog agent vmagentmetrics scraper metrics-server+ KSM Grafana Alloy→ Grafana Cloud Regional telemetry · DO VM droplets + external Vector aggregatorregional log fan-in ClickHousecustomer logs · VM droplet VictoriaMetricscustomer metrics · VM droplet Grafana Cloudplatform mon · external

The component map. Starform's pieces, grouped by where they run. Brand-blue = Starform-built · gray = third-party · darker = a data store · dashed = external SaaS. The telemetry path threads bottom-up — cluster collectors (Fluent Bit, vmagent) → regional stores → Stardeck.

The mental model

The mental model

Monitoring is two separate pipes — one for metrics, one for logs — that never touch until they meet on the dashboard. Alongside them runs a third plane (Grafana Cloud, external) that watches Starform's own machinery for the SRE team.

Nobody "queries a new pod." Metrics are pulled on a timer, logs are tailed off the machine, and traces (later) are pushed by the app. Three signals, three habits.

Regional VPC /16 — one per region Customer cluster · ns = project pods Envoy GW cAdvisor kube-state-metrics vmagent · metrics Fluent Bit · logs Grafana Alloy · platform ingest · intra-VPC $0 Regional telemetry · DO VM droplets VictoriaMetricsmetrics Vectoraggregator ClickHouselogs vmauth · read proxy → VictoriaMetrics · also ingest front-door read-only ClickHouse user native RBAC · not a separate service private VPC peering · read 1/region (~12 ≪ 50) Control plane · central · own VPC Starbase APIinjects tenant filter (FR-065) Stardeckfiltered view MVP: control plane co-locates in one region (no x-region read yet) Grafana Cloud — external SaaS platform self-monitoring · fed by Grafana Alloy

Diagram 1 — VPC topology. The regional VPC (one per region) holds the customer cluster and its telemetry VM stores, so ingest is intra-VPC at $0. The central control plane (its own VPC) peers in to read each region's stores — only through the authed front-door, in two parts: vmauth (a proxy deployment) for VictoriaMetrics, and a read-only ClickHouse user (native RBAC, not a service) for logs. The stores stay private; Starbase injects the tenant filter. One VPC + one peering per region (~12 ≪ 50). A dedicated Grafana Alloy agent ships platform series to Grafana Cloud (external) over the internet. Arrow colour: blue = metrics pipe · amber = logs pipe. At MVP the control plane is single-region (co-located), so there is no cross-region read yet.

The lifecycle in one breath

A pod is born → on the next timer tick its meters get read and its log file gets tailed (no one contacts the pod) → each signal is tagged with project · environment · service → shipped over the private network → stored (metrics in VictoriaMetrics, logs in ClickHouse) → the dashboard shows it through a filter the user can't remove. A pod dies → readings stop, series age out by retention, the log file disappears. Nothing to clean up.

Step What happens
01 Born Shuttle creates the pod, its labels, and its HTTPRoute.
02 Read & tailed Next tick: vmagent scrapes kubelet + Envoy; Fluent Bit picks up the new log file.
03 Tagged Identity attached — by join, by parse, or by label.
04 Shipped Over the regional VPC (intra-VPC, $0) with a per-cluster token.
05 Stored Metrics → VictoriaMetrics. Logs → ClickHouse.
06 Served API forces a tenant filter; Stardeck renders it.

What's in, and what's a stub

Area Status Notes
Customer metrics — latency, RPS, throughput, error rate, CPU, memory, network In MVP Seven signals from two sources (Diagram 3).
Customer logs — runtime + build, live tail In MVP Fluent Bit → Vector aggregator → ClickHouse.
Platform self-monitoring — alerts on Starform's own components In MVP Grafana Cloud (external), fed by a dedicated Grafana Alloy agent.
Preview-env per-route L7 metrics Stub Previews still get logs + CPU/mem. Per-route latency/RPS handling (short retention vs aggregate vs suppress) is to be architected — see Open items.
Distributed tracing Stub Push model, needs app instrumentation. Future plane — sketched, not built.
Per-tier metric retention Deferred MVP = global 90d. Logs already tier cleanly.

Next: Chapter 1 · Architecture & components — namespaces, the HTTPRoute name, and the label set, with the three identity conventions everything below depends on.