Shuttle Observability¶
§26.1 Shuttle Self-Observability¶
/healthz— Liveness probe. 200 if Manager is running./readyz— Readiness probe. 200 when Informer cache is synced and Starbase contacted at least once./metrics— Prometheus endpoint. Standard controller-runtime metrics plus custom:starform_snapshots_sent_total,starform_snapshots_failed_total,starform_capacity_reports_sent_total,starform_reconcile_errors_total,starform_starbase_request_duration_seconds,starform_informer_last_event_timestamp_seconds{resource},starform_informer_watch_errors_total{resource}.
Why starform_informer_last_event_timestamp_seconds matters: the Snapshot Runnable derives
billing from the Informer cache. If a watch stream silently dies and fails to reconnect cleanly, the
cache freezes and Shuttle bills off a stale view of reality without any reconcile error. This metric
is a per-resource-type gauge of the unix timestamp of the last received watch event; alerts fire on
time() - starform_informer_last_event_timestamp_seconds{resource="pods"} > 300s. Pairs with
starform_informer_watch_errors_total to distinguish silent staleness from flapping reconnects.
§26.2 Customer Workload Observability¶
Shuttle does NOT ship customer logs or metrics. Autonomous tools handle this:
| Tool | Type | Role |
|---|---|---|
| Fluent Bit | DaemonSet | Tails /var/log/containers/*.log, enriches with pod labels, forwards to the regional Vector aggregator → ClickHouse |
| vmagent | Deployment | Scrapes three sources — Envoy Gateway stats (latency, RPS, throughput, error rate), kubelet/cAdvisor (CPU, memory, network), and kube-state-metrics — promotes identity labels, and remote_writes to VictoriaMetrics |
Log flow: Customer Pod stdout → Node filesystem → Fluent Bit (agent) → regional Vector
aggregator → ClickHouse (partitioned by project_id, scoped further by environment +
service_id)
Metric flow: Envoy stats + cAdvisor + kube-state-metrics → vmagent → VictoriaMetrics (labeled
with project_id, environment, service_id; workspace_id carried for billing-boundary
queries). See §35.2 for sources,
attribution, and cardinality discipline.
Shuttle's only role: apply the standard label set (§24) to every customer resource. Fluent Bit and vmagent use these labels automatically.
§26.3 Who Installs Fluent Bit, vmagent, and Alloy¶
Starbase Worker installs the in-cluster agents during cluster bootstrap via Helm, before Shuttle is installed. The provisioning sequence:
- Starbase creates DOKS cluster via DO API
- Starbase installs Envoy Gateway via Helm
- Starbase installs metrics-server (DOKS does not install it by default; HPA and
kubectl topare non-functional without it) and kube-state-metrics via Helm - Starbase installs Fluent Bit, vmagent, and Grafana Alloy via Helm (Fluent Bit: tail
kubernetesenrich +forwardto the regional Vector aggregator; vmagent: relabel configs + a singleremote_writeto the regional VictoriaMetrics, Envoy + cAdvisor + kube-state-metrics scrape targets; Grafana Alloy: scrapes Starform-component series andremote_writes them to Grafana Cloud for platform self-monitoring, §35.5)- Starbase installs Shuttle (deploy/ manifests)
- Starbase registers cluster as active
Shuttle contains no Helm logic, no Fluent Bit config, no vmagent config, no Alloy config.
Regional telemetry stores are not part of cluster bootstrap. The per-region ClickHouse and VictoriaMetrics VM droplets and the Vector aggregator are provisioned out-of-band (Terraform + cloud-init/systemd), once per region, independent of the K8s cluster lifecycle — a cluster can be rebuilt without touching telemetry data. Backups are DO volume snapshots (SOC2, §39.2).