Document History¶

The living decision log (the successor to §41). Read it to understand why a decision was made before changing it. Versions track the legacy single-file PRD through v1.13, after which the docs were migrated to this MkDocs site. (Source: §41.)

Versioning after the migration

The single-file PRD's Version: header ended at v1.13. The documentation is now this docs/ site; new changes are recorded here (and in git history), not in a single file header. The legacy PRD is preserved at _legacy/master_prd.md.

Version	Date	Changes
1.0	April 2026	Initial architecture finalized. Shuttle PRD complete. Starbase architecture defined. Gateway decision (Envoy Gateway) locked. Naming system established.
1.1	April 2026	Added §15 (RBAC & Permissions Model) with workspace roles (Owner/Admin/Billing/Member), project roles (Admin/Developer/Viewer), environment protection flag, and database schema. Renumbered subsequent sections. Frontend named Stardeck (Ship's command deck).
1.2	April 2026	Rewrote §16 (Starforge): MVP adapter switched from Google Cloud Build + Nixpacks to Depot SaaS with Dockerfile frontend. Added decision rationale, per-customer project isolation model, log streaming to ClickHouse, known limitations, and migration triggers for future self-hosted BuildKit. Updated Tech Stack Summary, External Services, and ports detail table to reflect Depot as MVP build backend.
1.3	April 2026	Moved Railpack from post-MVP to primary build frontend at MVP, with Dockerfile as explicit fallback. Rationale: requiring a Dockerfile at launch was a visible UX regression vs Railway, Render, Heroku, Vercel. Railpack (MIT, Railway-maintained, BuildKit LLB-native) integrates as a Go library in Starbase Worker and submits LLB directly to Depot's BuildKitService. Adds ~2–3 weeks to MVP scope; removes the largest UX gap vs competitors. Updated §16.5, 16.8 limitations table, Tech Stack Summary, External Services row, adapter layout, and ports detail table.
1.4	April 2026	Corrected Depot pricing throughout §16. Previous versions stated Depot Docker build rate as $0.004/min — actual public rate is $0.04/min (10× higher). Build-minute sell price set to $0.05/min (~20% margin over cost at list rates). Added explicit note that Business-plan negotiation is required before launch to improve margins. Added Depot cache storage ($0.20/GB/mo) as tracked infrastructure overhead. Revised migration triggers with concrete Depot-spend thresholds. Financial Model v2.1 regenerated with matching corrections.
1.5	April 2026	Added §38 (Var Groups): Render-style environment groups as the primitive for env vars and secrets. Separate K8s Secret per Var Group with attach_order precedence, env-scoped via annotations, conflicts allowed with UI warning. Added paragraph to §32 pinning desired state computation code location (`internal/service/desiredstate.go`) and clarifying contract-vs-implementation split. Rewrote §39 (Not Yet Designed) into three priority tiers with ~35 items. Auth system and cluster bootstrap moved to post-MVP. Added MVP-blocking items: deployments & rollbacks, health checks, service types, volumes, wildcard domains, HPA, internal networking, live log streaming, build overrides, customer metrics dashboard. Added pre-launch-blocking items: audit logs, API tokens, status page/SLA, GDPR, rate limiting. Renumbered old §38 to 39 and old §39 to 40.
1.6	April 2026	Namespace model change. Switched from namespace-per-customer to namespace-per-project with environments as first-class labels (§20). Rationale: project-level cluster affinity (all environments of a project schedule to the same cluster), operational manageability at scale, and alignment with scheduling-unit = isolation-unit principle. NetworkPolicies now use label selectors to enforce cross-environment isolation within a shared namespace. Labels expanded (§24): added `starform.io/environment`, `starform.io/workspace-id`, `starform.io/service-name`, `starform.io/service-type`, `starform.io/observability-enabled`, plus the standard K8s recommended label set (`app.kubernetes.io/`). New §24B (Annotations):* comprehensive annotation catalog organized by resource type — provenance, rollout triggers (with Var Group checksum spec), DO CCM, Prometheus scrape, cert-manager, HTTPRoute, namespace-level. §38.7 updated with concrete checksum computation. §16 additions: §16.11 (Git integration — webhook signature verification, deduplication via `webhook_deliveries` table, 5-second push debouncing, GitHub App lifecycle events, stale installation detection, rate limiting), §16.12 (build lifecycle UX — state machine with cancellation, 4-step progress visibility, post-deploy credential display). §39 additions: managed object storage primitive (MVP), per-environment branch config, encryption-at-rest catalog, admission policy for label enforcement, project-to-cluster migration workflow, encryption key rotation. New §40 (Functional Requirements & Success Criteria): 62 FRs and 15 SCs in normative FR-###/SC-### format for test mapping and due diligence. Document History renumbered to 41.
1.7	April 2026	Label reclassification (§24). Split identity labels into two subsections: 24.1 Load-Bearing (8 labels required for correctness — system breaks if missing) and 24.2 Operational (3 labels including `tier`, honestly described as denormalized convenience for kubectl/dashboard/analytics — not billing-critical). Added explicit billing note: labels are not used for billing math; Starbase computes per-pod cost from PodSpec resource requests. The `tier` label was previously framed as billing-critical — that framing was wrong and has been corrected. Namespace Labels renumbered from 24.3 to 24.4. Object storage provider change. Switched from DO Spaces to Tigris via Partner Integration API. Rationale: Tigris has a purpose-built partner API for platforms (one API call creates an isolated tenant organization per customer), zero egress fees (strengthens Starform's egress-differentiation story), S3-compatible, globally distributed by default (no CDN integration needed). Updated §4.3 External Services, §6 Tech Stack Summary, §9 Package Layout (new `adapter/tigris/` folder, removed `storage.go` from DO adapter), §10 Ports Detail (StorageProvider description updated), §11 Adapter Wiring example, §37 BYOC table (with note that Tigris is cloud-agnostic and can remain in BYOC deployments), and §39.1 item 16 (MVP object storage primitive now references Tigris with explicit action item to negotiate partner pricing and SLA before launch). No pricing numbers added to PRD pending confirmed partner terms.
1.9	June 2026	Monitoring architecture finalized end-to-end. HTTPRoute naming pinned (§20.2): route name = `<project_uuid><service_uuid>-<environment>` with hyphen-stripped 32-char UUIDs and a positional parse, because Envoy's per-route metrics expose customer identity only via the cluster-name string and environments share a namespace. This supersedes the earlier "recover `project_id` via a kube-state-metrics namespace-label join" approach — folding all three IDs into the route name removes the cross-metric join and the kube-state-metrics dependency for Envoy attribution. Metrics pipeline rewritten (§35.2): three sources documented — Envoy Gateway (latency, RPS, throughput, error rate), kubelet/cAdvisor (CPU, memory, network), and kube-state-metrics — closing the CPU/memory collection gap. Three attribution mechanisms (cAdvisor joined on `(namespace, pod)` to `kube_pod_labels` for resource metrics — making kube-state-metrics load-bearing; `envoy_cluster_name` parsing for Envoy L7 metrics; direct pod-label promotion only for optional app `/metrics`), with the two customer-facing paths emitting identical label keys. Mandatory Envoy cardinality discipline added (stat-inclusion matcher, custom histogram buckets, no per-route metrics for preview envs). New §35.4 (Telemetry Transport): VPC peering ($0 egress), vmagent `remote_write` → vmauth, Vector agent → regional aggregator → ClickHouse, per-cluster bearer auth — the layer prior versions understated under "autonomous, zero shipping code." New §35.5 (Platform Self-Monitoring): self-hosted vmalert + Alertmanager against a separate internal VictoriaMetrics watching Starbase, Postgres, ClickHouse, VM, Envoy, LBs, and Shuttle; one vmagent routing to two stores by per-URL relabeling. Tenant-key reconciliation: removed the stale `customer_id` field throughout (§19.3, 25.1, 25.2, 26.2, 35.1, 35.2); canonical identity is `project_id` + `environment` + `service_id`, with `workspace_id` as the billing-boundary label — §24's catalog was already correct; the prose lagged it. Bootstrap (§26.3) + Cluster topology (§4.2): added metrics-server and kube-state-metrics; metrics-server is not bundled on DOKS and must be installed (HPA/`kubectl top` depend on it). HPA (§39.1 #9): MVP scales on CPU/memory via metrics-server only; RPS/latency scaling deferred (needs prometheus-adapter); clarified FR-050's display path is cAdvisor→VM, distinct from HPA's source. Retention (FR-051): split — logs via per-tier ClickHouse partition TTL; metrics global 90d on single-node VM at MVP, per-tier deferred. Distributed tracing confirmed out of MVP scope. Structure & readability: added Part 0 (Orientation) — system primer, end-to-end deploy walkthrough, a canonical-source map (one home per concept), and an MVP-scope-at-a-glance table. Softened the status line (was "Architecture Finalized" despite ~40 open items in §39). Synced the §2 diagram to the six-component customer cluster. BYOC reframed as out of scope (not offered) across §12, §17, §37, and §39 — content retained as a portability proof, not deleted; clarified the distinction from Starform-operated multi-cloud regions, which remain on the geographic-expansion roadmap. No content removed in this version. Consistency sweep: propagated the monitoring decisions into the summary tables and requirements — §4.1 (internal VictoriaMetrics, vmalert, Alertmanager, Vector aggregator added; customer-vs-internal store distinguished), the §2 infra diagram (platform-monitoring + aggregator row), §6 Tech Stack (metric sources, metrics-server, platform alerting, telemetry transport rows), and §40 (new FR-063–FR-069 for L7 metrics, tenant key, query isolation, private transport, Envoy cardinality, metrics-server, and platform self-monitoring; new SC-016–SC-018 for tenant isolation, alert latency, and metric freshness). Existing FR/SC numbering preserved (additions only). Identity propagation: carried the tenant-tuple decision through the desired-state contract (§32 — new "Identity in the payload" note; `environments` added to the computation SELECT), the §9 store layout (added `environment.go`, `var_group.go`, `bucket.go`), and the §39.1 #1 schema item (added the `environments` table and an explicit identity-model note: no `customers` table; canonical key `project_id` + `environment` + `service_id`; `workspace_id` as billing label).
1.10	June 2026	Monitoring topology revised for MVP; observability split into a self-contained SRE guide. Platform self-monitoring → Grafana Cloud (§35.5, FR-069): retired the self-hosted internal VictoriaMetrics + vmalert + Alertmanager for Grafana Cloud (hosted metrics + Grafana Alerting + OnCall) — an external watcher survives infra failure; MVP-only, revisit post-MVP (§39.3 #42). Net: per region VictoriaMetrics drops from two instances (customer + internal) to one (customer); ClickHouse unchanged. GDPR guardrail: customer identifiers excluded from platform series, or an EU-residency stack (§39.2 #22). Central control-plane region (§4.1): Starbase + Stardeck + Postgres run in one region at MVP; telemetry stores are per region. The only cross-region traffic is the desired-state pull and Starbase's dashboard reads, routed per project's region via a `region→store-endpoint` lookup (no fan-out, §35.4); single-region control plane is an accepted SPOF — workloads survive via Shuttle's level-driven loop (§39.3 #45). Telemetry stores on VMs (§4.1, §26.3): ClickHouse and VictoriaMetrics move from in-cluster to dedicated per-region DO VM droplets (separate droplet per store), provisioned out-of-band via Terraform/cloud-init — consistent with DO Managed Postgres/Valkey already off-cluster; backups via volume snapshots. Log pipeline → hybrid Fluent Bit → Vector (§35.1, §35.4): customer-cluster agent switched from Vector to Fluent Bit (~64Mi/node, lighter on tenant nodes, less kube-apiserver load at high namespace counts); the regional Vector aggregator keeps the native batching `clickhouse` sink (avoids "too many parts"), receiving via its `fluent` source. Arbitrary environment names (§20.2, §24.1, FR-070): no fixed dev/staging/prod enum — names are customer-chosen, validated as an RFC 1123 label (≤30 chars) because the name is load-bearing in the HTTPRoute parse and K8s labels. Preview/ephemeral environments are identified by a structural `is_ephemeral` flag, not name-matching — what the per-route preview-metric suppression keys off (FR-067, §39.1 #1). New open items (§39.3): self-hosted-monitoring migration (#42), cross-region query-proxy/cache (#43), multi-region aggregator placement (#44), multi-region control plane (#45). Companion doc: the Observability Architecture guide (`architecture/observability.html`) is rewritten as a self-contained SRE implementation guide — the SRE does not receive this PRD, so the guide inlines the observability-relevant contracts (route parse §20.2, label catalog §24, tenant key §24.1, query filter FR-065) with provenance tags ("owned by PRD §X; PRD wins on conflict") and a sync checklist, plus worked config examples per chapter. A scoped exception to the "companion docs never duplicate" rule (CLAUDE.md §6). Consistency sweep across §2 (topology diagram), §4 (control-plane region + regional telemetry tier + Fluent Bit customer cluster), §6 (tech stack), §40 (FR-067/069 amended, FR-070 added; SC-017 → Grafana).
1.11	June 2026	Cross-region read path, VPC/IP topology, and versionless rename. Control plane stays central (§4.1, §39.3 #45): one Starbase + one Postgres (Railway's model) — distributing it (Fly's model) needs a distributed DB / Corrosion-class build; out of near-term scope, SPOF accepted for MVP. Cross-region telemetry reads (§35.4, FR-071): the central control plane reaches each region's VPC-private stores over private cross-region VPC peering, through the existing authenticated front-doors — vmauth (VictoriaMetrics) + a read-only ClickHouse user — injecting the FR-065 filter itself. No new component; nothing internet-facing. The regional read-API/cache (§39.3 #43) is demoted to an optional post-MVP latency optimization. VPC & IP topology (§4.4): clusters in a region share one `/16` VPC (tenant isolation is namespace/NetworkPolicy, not VPC) → peerings track regions (~12 ≪ 50/account), not clusters. VPC-native DOKS (Cilium 1.31+) because Starform resells DO Managed DBs — customer pods reach their DB directly (DB sees the pod IP; isolated via Trusted Sources scoped to the pod subnet). IP plan: per-region node `/16` + per-cluster pod `/18` + service `/22` from a systematic 10/8 scheme, non-overlapping, in an IPAM registry (§39.3 #46) — not resizable after creation. DO viability stress-tested: every limit examined (peering 50, VPC /16, NIC bandwidth 2–25 Gbps, cluster/DB counts) is either huge headroom or a support-raised soft-quota (#47) — DO is a sound launch-and-scale foundation, with ports/adapters portability as the hedge. Rename: the master PRD is now versionless (`master_prd.md`); the header is the source of truth. Header → 1.11. Added FR-071; §39.3 #43 reframed, #45 updated, #46/#47 added.
1.12	June 2026	Platform self-monitoring → dedicated Grafana Alloy agent. §35.5, §4.2, §6, §26.3, FR-069: the platform plane no longer rides a vmagent `remote_write` split. A dedicated Grafana Alloy agent (Grafana's `k8s-monitoring` Helm chart) runs in each customer cluster, scrapes Starform-component series (Shuttle, Envoy gateway-self, node-exporter, KSM platform objects), and `remote_write`s them to Grafana Cloud — an independent path so a vmagent fault can't blind the SRE to platform health, and the standard Grafana Cloud Kubernetes onboarding. vmagent simplifies to a single destination (the regional customer VictoriaMetrics); the GDPR `labeldrop` of customer identifiers moves from vmagent's relabel rules to Alloy's. Net: the customer cluster goes six → seven components — the accepted trade-off (raised and committed) for an independent external watcher. Verified: Grafana Agent reached EOL Nov 2025; Alloy is its supported OpenTelemetry-collector successor. Companion doc (`architecture/observability.html`): rebuilt Diagrams 0/1/2/3 + bootstrap — added the Alloy box, split the read front-door into its two real parts (vmauth proxy deployment + a native read-only ClickHouse user; chproxy ruled out as unnecessary at MVP), and gave the metrics vs logs pipelines distinct arrow colours (blue = metrics, amber = logs). Header → 1.12.
1.13	June 2026	Fixed an Envoy stat-inclusion contradiction (§35.2). The cardinality matcher is a strict allowlist, so "keep only the three `upstream_` customer families" silently dropped Envoy's gateway-self* metrics (`server.`, `control_plane.`, `listener_manager.`, `cluster_manager.`) — the very metrics §35.2/§35.5 say must reach platform self-monitoring. §35.2 step 1 now allowlists both the per-route customer families and the bounded gateway-self families (per-instance, so no per-route cardinality cost; vmagent keeps the customer set → VictoriaMetrics, the Alloy agent keeps the gateway-self set → Grafana Cloud). Mirrored in `architecture/observability.html` ch.2 (EnvoyProxy example + step text + gotcha). Header → 1.13.
—	June 2026	Migrated to MkDocs. The single-file PRD + the hand-built observability HTML were converted into this component-first MkDocs Material site under `docs/` (the new source of truth); originals deprecated under `_legacy/`. Diagrams converted to theme-aware Mermaid; light + dark themes from the Starform frontend tokens; the `§`-cross-reference web preserved via stable `#section-N-M` anchors + the Canonical Sources index.

This is a living document. Pages will be expanded as components are designed and built.