Shuttle Architecture¶
§19.1 Deployment Topology¶
- One Deployment per cluster, 1 replica
- Dedicated namespace:
starform-system - Dedicated ServiceAccount with scoped RBAC (see §22)
- Exposes
/healthzand/metricson port 8080 (handled by controller-runtime Manager) - Outbound HTTPS to Starbase API for all communication
§19.2 Internal Component Structure¶
Shuttle uses sigs.k8s.io/controller-runtime as a library, not a framework. The Manager handles
lifecycle, signal handling, the Informer cache, the cached client, and the metrics endpoint. On top
of that, Shuttle runs three Runnables — each is a goroutine on a ticker that the Manager starts and
gracefully stops.
Shuttle does not use the reconcile.Reconciler interface, the For(...).Complete(r) builder, or
the K8s-event-driven workqueue. Those exist to drive reconciliation from K8s API events, but
Shuttle's trigger is Starbase (via HTTP polling), not K8s events. The Informer cache is still used —
for fast reads and instant drift detection — but it's accessed through the cached client during
ticker-driven sweeps, not as an event source.
flowchart TB
classDef built fill:#3434DC22,stroke:#3434DC,color:#5B5EE8;
classDef third fill:transparent,stroke:#808080,color:#808080;
subgraph BIN["Shuttle Binary"]
direction TB
MGR["controller-runtime Manager<br/>lifecycle · signals · Informer cache<br/>cached client · /healthz · /metrics"]
subgraph RUN["Runnables"]
direction LR
DSR["Desired State<br/>Runnable"]
SNR["Snapshot<br/>Runnable"]
CAR["Capacity<br/>Runnable"]
end
HTTP["Starbase HTTP Client<br/>(shared)"]
MGR --> DSR
MGR --> SNR
MGR --> CAR
DSR --> HTTP
SNR --> HTTP
CAR --> HTTP
end
HTTP -. "HTTPS outbound only" .-> API["Starbase API"]
class MGR,DSR,SNR,CAR,HTTP,API built;
All three Runnables implement the manager.Runnable interface (a single
Start(ctx context.Context) error method) and are registered with the Manager via mgr.Add(...).
The Manager starts them when Shuttle boots and stops them gracefully on SIGTERM.
§19.3 Components¶
Desired State Runnable is Shuttle's "reconciler" — but it is NOT a reconcile.Reconciler. It's
a manager.Runnable that runs a 30-second ticker loop. On each tick, it:
- Pulls the current desired state for this cluster from Starbase via HTTP
- Reads the current Kubernetes state via the Manager's cached client (backed by the Informer)
- Diffs desired vs actual
- Applies creates/updates/deletes for the per-customer resource set
- Applies cluster infrastructure config (e.g., LB size unit annotation) if changed
The runnable is level-driven — it compares full desired vs actual state on every tick, not individual events. Drift detection is automatic: if a managed resource is deleted or modified outside Shuttle, the next tick (within 30 seconds) detects the discrepancy and reapplies the desired state.
The apply logic is structured as a sequence of sub-applies, each responsible for a specific resource type. All sub-applies are idempotent.
Snapshot Runnable implements manager.Runnable. Every 60 seconds, it walks the Informer cache,
builds a list of snapshot records for all pods labeled starform.io/managed-by=shuttle, and POSTs
the batch to Starbase. Each snapshot contains workspace_id (billing boundary), project_id,
environment, service_id, pod_id, tier, phase, cluster_id, and snapshot_timestamp. A
deterministic snapshot_id is computed as a hash of
project_id + service_id + pod_id + snapshot_timestamp for idempotent ingestion. (The stale
customer_id field was removed in v1.9; identity is now the project_id + environment +
service_id tuple, with workspace_id as the billing label.)
Capacity Runnable also implements manager.Runnable. Every 60 seconds, it counts customer pods
in the cache (filtered by label) and POSTs a capacity report to Starbase containing cluster_id,
current_pod_count, soft_limit, and timestamp.
Starbase HTTP Client is a thin wrapper around net/http with:
- Bearer token authentication (token loaded from a Kubernetes Secret at startup)
- Retry with exponential backoff (3 attempts)
- Timeout configuration (30s default)
- Shared
http.Clientwith connection pooling - JSON request/response encoding
§19.4 What Lives Where¶
- Informer cache — lives in Shuttle's process memory, managed by controller-runtime. Filtered
by label selector
starform.io/managed-by=shuttle. Rebuilt from scratch on restart. - Desired state — Starbase's responsibility. Shuttle never persists desired state locally — it pulls fresh on each tick.
- Snapshot data — streamed to Starbase as produced. No local buffering for MVP. If POST fails, retry 3x then drop (loss bounded to 60 seconds per cluster).
§19.5 Design Rationale¶
Why controller-runtime as a library, not a framework?
The standard Kubernetes operator pattern (Reconciler + For(...).Complete(r) + workqueue) is built
for operators whose primary input is K8s API events. Shuttle's primary input is the Starbase HTTP
API. There's no K8s object whose change should trigger a reconcile. Using manager.Runnable
directly is cleaner: the ticker IS the trigger, the runnable's Start(ctx) IS the loop.
Why still use controller-runtime at all?
- The Manager provides production-grade lifecycle management: graceful shutdown, metrics endpoint, health probes, leader election when needed later.
- The Informer cache gives fast reads and automatic drift detection — same infrastructure Render uses.
- The cached client unifies reads from cache with writes to the API server.
Reference architecture: Plural's agentk (github.com/pluralsh/kubernetes-agent) — same pattern: lightweight agent in customer clusters, outbound connection to central control plane, pulls instructions, reports state. Plural uses bidirectional gRPC streaming; we can upgrade from HTTP polling post-MVP.