Skip to content

Shuttle Observability

§26.1 Shuttle Self-Observability

  • /healthz — Liveness probe. 200 if Manager is running.
  • /readyz — Readiness probe. 200 when Informer cache is synced and Starbase contacted at least once.
  • /metrics — Prometheus endpoint. Standard controller-runtime metrics plus custom: starform_snapshots_sent_total, starform_snapshots_failed_total, starform_capacity_reports_sent_total, starform_reconcile_errors_total, starform_starbase_request_duration_seconds, starform_informer_last_event_timestamp_seconds{resource}, starform_informer_watch_errors_total{resource}.

Why starform_informer_last_event_timestamp_seconds matters: the Snapshot Runnable derives billing from the Informer cache. If a watch stream silently dies and fails to reconnect cleanly, the cache freezes and Shuttle bills off a stale view of reality without any reconcile error. This metric is a per-resource-type gauge of the unix timestamp of the last received watch event; alerts fire on time() - starform_informer_last_event_timestamp_seconds{resource="pods"} > 300s. Pairs with starform_informer_watch_errors_total to distinguish silent staleness from flapping reconnects.

§26.2 Customer Workload Observability

Shuttle does NOT ship customer logs or metrics. Autonomous tools handle this:

Tool Type Role
Fluent Bit DaemonSet Tails /var/log/containers/*.log, enriches with pod labels, forwards to the regional Vector aggregator → ClickHouse
vmagent Deployment Scrapes three sources — Envoy Gateway stats (latency, RPS, throughput, error rate), kubelet/cAdvisor (CPU, memory, network), and kube-state-metrics — promotes identity labels, and remote_writes to VictoriaMetrics

Log flow: Customer Pod stdout → Node filesystem → Fluent Bit (agent) → regional Vector aggregator → ClickHouse (partitioned by project_id, scoped further by environment + service_id)

Metric flow: Envoy stats + cAdvisor + kube-state-metrics → vmagent → VictoriaMetrics (labeled with project_id, environment, service_id; workspace_id carried for billing-boundary queries). See §35.2 for sources, attribution, and cardinality discipline.

Shuttle's only role: apply the standard label set (§24) to every customer resource. Fluent Bit and vmagent use these labels automatically.

§26.3 Who Installs Fluent Bit, vmagent, and Alloy

Starbase Worker installs the in-cluster agents during cluster bootstrap via Helm, before Shuttle is installed. The provisioning sequence:

  1. Starbase creates DOKS cluster via DO API
  2. Starbase installs Envoy Gateway via Helm
  3. Starbase installs metrics-server (DOKS does not install it by default; HPA and kubectl top are non-functional without it) and kube-state-metrics via Helm
  4. Starbase installs Fluent Bit, vmagent, and Grafana Alloy via Helm (Fluent Bit: tail
  5. kubernetes enrich + forward to the regional Vector aggregator; vmagent: relabel configs + a single remote_write to the regional VictoriaMetrics, Envoy + cAdvisor + kube-state-metrics scrape targets; Grafana Alloy: scrapes Starform-component series and remote_writes them to Grafana Cloud for platform self-monitoring, §35.5)
  6. Starbase installs Shuttle (deploy/ manifests)
  7. Starbase registers cluster as active

Shuttle contains no Helm logic, no Fluent Bit config, no vmagent config, no Alloy config.

Regional telemetry stores are not part of cluster bootstrap. The per-region ClickHouse and VictoriaMetrics VM droplets and the Vector aggregator are provisioned out-of-band (Terraform + cloud-init/systemd), once per region, independent of the K8s cluster lifecycle — a cluster can be rebuilt without touching telemetry data. Backups are DO volume snapshots (SOC2, §39.2).


Cross-references

Metric sources, attribution & cardinality → §35.2 · platform self-monitoring via Grafana Alloy → §35.5 · the labels Fluent Bit / vmagent consume → §24.1 · billing derived from snapshots → Billing · the Snapshot Runnable behind the billing feed → §19.3.