Skip to content

Shuttle Architecture

§19.1 Deployment Topology

  • One Deployment per cluster, 1 replica
  • Dedicated namespace: starform-system
  • Dedicated ServiceAccount with scoped RBAC (see §22)
  • Exposes /healthz and /metrics on port 8080 (handled by controller-runtime Manager)
  • Outbound HTTPS to Starbase API for all communication

§19.2 Internal Component Structure

Shuttle uses sigs.k8s.io/controller-runtime as a library, not a framework. The Manager handles lifecycle, signal handling, the Informer cache, the cached client, and the metrics endpoint. On top of that, Shuttle runs three Runnables — each is a goroutine on a ticker that the Manager starts and gracefully stops.

Shuttle does not use the reconcile.Reconciler interface, the For(...).Complete(r) builder, or the K8s-event-driven workqueue. Those exist to drive reconciliation from K8s API events, but Shuttle's trigger is Starbase (via HTTP polling), not K8s events. The Informer cache is still used — for fast reads and instant drift detection — but it's accessed through the cached client during ticker-driven sweeps, not as an event source.

flowchart TB
  classDef built fill:#3434DC22,stroke:#3434DC,color:#5B5EE8;
  classDef third fill:transparent,stroke:#808080,color:#808080;

  subgraph BIN["Shuttle Binary"]
    direction TB
    MGR["controller-runtime Manager<br/>lifecycle · signals · Informer cache<br/>cached client · /healthz · /metrics"]
    subgraph RUN["Runnables"]
      direction LR
      DSR["Desired State<br/>Runnable"]
      SNR["Snapshot<br/>Runnable"]
      CAR["Capacity<br/>Runnable"]
    end
    HTTP["Starbase HTTP Client<br/>(shared)"]
    MGR --> DSR
    MGR --> SNR
    MGR --> CAR
    DSR --> HTTP
    SNR --> HTTP
    CAR --> HTTP
  end
  HTTP -. "HTTPS outbound only" .-> API["Starbase API"]

  class MGR,DSR,SNR,CAR,HTTP,API built;
Diagram — Shuttle internal component structure. The controller-runtime Manager owns lifecycle and the Informer cache; three ticker-driven Runnables sit on top and share one outbound Starbase HTTP client. Everything is Starform-built (brand blue). Outbound traffic is HTTPS only.

All three Runnables implement the manager.Runnable interface (a single Start(ctx context.Context) error method) and are registered with the Manager via mgr.Add(...). The Manager starts them when Shuttle boots and stops them gracefully on SIGTERM.

§19.3 Components

Desired State Runnable is Shuttle's "reconciler" — but it is NOT a reconcile.Reconciler. It's a manager.Runnable that runs a 30-second ticker loop. On each tick, it:

  1. Pulls the current desired state for this cluster from Starbase via HTTP
  2. Reads the current Kubernetes state via the Manager's cached client (backed by the Informer)
  3. Diffs desired vs actual
  4. Applies creates/updates/deletes for the per-customer resource set
  5. Applies cluster infrastructure config (e.g., LB size unit annotation) if changed

The runnable is level-driven — it compares full desired vs actual state on every tick, not individual events. Drift detection is automatic: if a managed resource is deleted or modified outside Shuttle, the next tick (within 30 seconds) detects the discrepancy and reapplies the desired state.

The apply logic is structured as a sequence of sub-applies, each responsible for a specific resource type. All sub-applies are idempotent.

Snapshot Runnable implements manager.Runnable. Every 60 seconds, it walks the Informer cache, builds a list of snapshot records for all pods labeled starform.io/managed-by=shuttle, and POSTs the batch to Starbase. Each snapshot contains workspace_id (billing boundary), project_id, environment, service_id, pod_id, tier, phase, cluster_id, and snapshot_timestamp. A deterministic snapshot_id is computed as a hash of project_id + service_id + pod_id + snapshot_timestamp for idempotent ingestion. (The stale customer_id field was removed in v1.9; identity is now the project_id + environment + service_id tuple, with workspace_id as the billing label.)

Capacity Runnable also implements manager.Runnable. Every 60 seconds, it counts customer pods in the cache (filtered by label) and POSTs a capacity report to Starbase containing cluster_id, current_pod_count, soft_limit, and timestamp.

Starbase HTTP Client is a thin wrapper around net/http with:

  • Bearer token authentication (token loaded from a Kubernetes Secret at startup)
  • Retry with exponential backoff (3 attempts)
  • Timeout configuration (30s default)
  • Shared http.Client with connection pooling
  • JSON request/response encoding

§19.4 What Lives Where

  • Informer cache — lives in Shuttle's process memory, managed by controller-runtime. Filtered by label selector starform.io/managed-by=shuttle. Rebuilt from scratch on restart.
  • Desired state — Starbase's responsibility. Shuttle never persists desired state locally — it pulls fresh on each tick.
  • Snapshot data — streamed to Starbase as produced. No local buffering for MVP. If POST fails, retry 3x then drop (loss bounded to 60 seconds per cluster).

§19.5 Design Rationale

Why controller-runtime as a library, not a framework?

The standard Kubernetes operator pattern (Reconciler + For(...).Complete(r) + workqueue) is built for operators whose primary input is K8s API events. Shuttle's primary input is the Starbase HTTP API. There's no K8s object whose change should trigger a reconcile. Using manager.Runnable directly is cleaner: the ticker IS the trigger, the runnable's Start(ctx) IS the loop.

Why still use controller-runtime at all?

  1. The Manager provides production-grade lifecycle management: graceful shutdown, metrics endpoint, health probes, leader election when needed later.
  2. The Informer cache gives fast reads and automatic drift detection — same infrastructure Render uses.
  3. The cached client unifies reads from cache with writes to the API server.

Reference architecture: Plural's agentk (github.com/pluralsh/kubernetes-agent) — same pattern: lightweight agent in customer clusters, outbound connection to central control plane, pulls instructions, reports state. Plural uses bidirectional gRPC streaming; we can upgrade from HTTP polling post-MVP.


Cross-references

Cluster RBAC for the ServiceAccount → §22 · the resources the Desired State Runnable applies → §20 · the API the HTTP client speaks → §25 · the label set the Informer filters on → §24.1 · desired-state model → §32.