Skip to content

Document History

The living decision log (the successor to §41). Read it to understand why a decision was made before changing it. Versions track the legacy single-file PRD through v1.13, after which the docs were migrated to this MkDocs site. (Source: §41.)

Versioning after the migration

The single-file PRD's Version: header ended at v1.13. The documentation is now this docs/ site; new changes are recorded here (and in git history), not in a single file header. The legacy PRD is preserved at _legacy/master_prd.md.

Version Date Changes
1.0 April 2026 Initial architecture finalized. Shuttle PRD complete. Starbase architecture defined. Gateway decision (Envoy Gateway) locked. Naming system established.
1.1 April 2026 Added §15 (RBAC & Permissions Model) with workspace roles (Owner/Admin/Billing/Member), project roles (Admin/Developer/Viewer), environment protection flag, and database schema. Renumbered subsequent sections. Frontend named Stardeck (Ship's command deck).
1.2 April 2026 Rewrote §16 (Starforge): MVP adapter switched from Google Cloud Build + Nixpacks to Depot SaaS with Dockerfile frontend. Added decision rationale, per-customer project isolation model, log streaming to ClickHouse, known limitations, and migration triggers for future self-hosted BuildKit. Updated Tech Stack Summary, External Services, and ports detail table to reflect Depot as MVP build backend.
1.3 April 2026 Moved Railpack from post-MVP to primary build frontend at MVP, with Dockerfile as explicit fallback. Rationale: requiring a Dockerfile at launch was a visible UX regression vs Railway, Render, Heroku, Vercel. Railpack (MIT, Railway-maintained, BuildKit LLB-native) integrates as a Go library in Starbase Worker and submits LLB directly to Depot's BuildKitService. Adds ~2–3 weeks to MVP scope; removes the largest UX gap vs competitors. Updated §16.5, 16.8 limitations table, Tech Stack Summary, External Services row, adapter layout, and ports detail table.
1.4 April 2026 Corrected Depot pricing throughout §16. Previous versions stated Depot Docker build rate as $0.004/min — actual public rate is $0.04/min (10× higher). Build-minute sell price set to $0.05/min (~20% margin over cost at list rates). Added explicit note that Business-plan negotiation is required before launch to improve margins. Added Depot cache storage ($0.20/GB/mo) as tracked infrastructure overhead. Revised migration triggers with concrete Depot-spend thresholds. Financial Model v2.1 regenerated with matching corrections.
1.5 April 2026 Added §38 (Var Groups): Render-style environment groups as the primitive for env vars and secrets. Separate K8s Secret per Var Group with attach_order precedence, env-scoped via annotations, conflicts allowed with UI warning. Added paragraph to §32 pinning desired state computation code location (internal/service/desiredstate.go) and clarifying contract-vs-implementation split. Rewrote §39 (Not Yet Designed) into three priority tiers with ~35 items. Auth system and cluster bootstrap moved to post-MVP. Added MVP-blocking items: deployments & rollbacks, health checks, service types, volumes, wildcard domains, HPA, internal networking, live log streaming, build overrides, customer metrics dashboard. Added pre-launch-blocking items: audit logs, API tokens, status page/SLA, GDPR, rate limiting. Renumbered old §38 to 39 and old §39 to 40.
1.6 April 2026 Namespace model change. Switched from namespace-per-customer to namespace-per-project with environments as first-class labels (§20). Rationale: project-level cluster affinity (all environments of a project schedule to the same cluster), operational manageability at scale, and alignment with scheduling-unit = isolation-unit principle. NetworkPolicies now use label selectors to enforce cross-environment isolation within a shared namespace. Labels expanded (§24): added starform.io/environment, starform.io/workspace-id, starform.io/service-name, starform.io/service-type, starform.io/observability-enabled, plus the standard K8s recommended label set (app.kubernetes.io/*). New §24B (Annotations): comprehensive annotation catalog organized by resource type — provenance, rollout triggers (with Var Group checksum spec), DO CCM, Prometheus scrape, cert-manager, HTTPRoute, namespace-level. §38.7 updated with concrete checksum computation. §16 additions: §16.11 (Git integration — webhook signature verification, deduplication via webhook_deliveries table, 5-second push debouncing, GitHub App lifecycle events, stale installation detection, rate limiting), §16.12 (build lifecycle UX — state machine with cancellation, 4-step progress visibility, post-deploy credential display). §39 additions: managed object storage primitive (MVP), per-environment branch config, encryption-at-rest catalog, admission policy for label enforcement, project-to-cluster migration workflow, encryption key rotation. New §40 (Functional Requirements & Success Criteria): 62 FRs and 15 SCs in normative FR-###/SC-### format for test mapping and due diligence. Document History renumbered to 41.
1.7 April 2026 Label reclassification (§24). Split identity labels into two subsections: 24.1 Load-Bearing (8 labels required for correctness — system breaks if missing) and 24.2 Operational (3 labels including tier, honestly described as denormalized convenience for kubectl/dashboard/analytics — not billing-critical). Added explicit billing note: labels are not used for billing math; Starbase computes per-pod cost from PodSpec resource requests. The tier label was previously framed as billing-critical — that framing was wrong and has been corrected. Namespace Labels renumbered from 24.3 to 24.4. Object storage provider change. Switched from DO Spaces to Tigris via Partner Integration API. Rationale: Tigris has a purpose-built partner API for platforms (one API call creates an isolated tenant organization per customer), zero egress fees (strengthens Starform's egress-differentiation story), S3-compatible, globally distributed by default (no CDN integration needed). Updated §4.3 External Services, §6 Tech Stack Summary, §9 Package Layout (new adapter/tigris/ folder, removed storage.go from DO adapter), §10 Ports Detail (StorageProvider description updated), §11 Adapter Wiring example, §37 BYOC table (with note that Tigris is cloud-agnostic and can remain in BYOC deployments), and §39.1 item 16 (MVP object storage primitive now references Tigris with explicit action item to negotiate partner pricing and SLA before launch). No pricing numbers added to PRD pending confirmed partner terms.
1.9 June 2026 Monitoring architecture finalized end-to-end. HTTPRoute naming pinned (§20.2): route name = <project_uuid><service_uuid>-<environment> with hyphen-stripped 32-char UUIDs and a positional parse, because Envoy's per-route metrics expose customer identity only via the cluster-name string and environments share a namespace. This supersedes the earlier "recover project_id via a kube-state-metrics namespace-label join" approach — folding all three IDs into the route name removes the cross-metric join and the kube-state-metrics dependency for Envoy attribution. Metrics pipeline rewritten (§35.2): three sources documented — Envoy Gateway (latency, RPS, throughput, error rate), kubelet/cAdvisor (CPU, memory, network), and kube-state-metrics — closing the CPU/memory collection gap. Three attribution mechanisms (cAdvisor joined on (namespace, pod) to kube_pod_labels for resource metrics — making kube-state-metrics load-bearing; envoy_cluster_name parsing for Envoy L7 metrics; direct pod-label promotion only for optional app /metrics), with the two customer-facing paths emitting identical label keys. Mandatory Envoy cardinality discipline added (stat-inclusion matcher, custom histogram buckets, no per-route metrics for preview envs). New §35.4 (Telemetry Transport): VPC peering ($0 egress), vmagent remote_write → vmauth, Vector agent → regional aggregator → ClickHouse, per-cluster bearer auth — the layer prior versions understated under "autonomous, zero shipping code." New §35.5 (Platform Self-Monitoring): self-hosted vmalert + Alertmanager against a separate internal VictoriaMetrics watching Starbase, Postgres, ClickHouse, VM, Envoy, LBs, and Shuttle; one vmagent routing to two stores by per-URL relabeling. Tenant-key reconciliation: removed the stale customer_id field throughout (§19.3, 25.1, 25.2, 26.2, 35.1, 35.2); canonical identity is project_id + environment + service_id, with workspace_id as the billing-boundary label — §24's catalog was already correct; the prose lagged it. Bootstrap (§26.3) + Cluster topology (§4.2): added metrics-server and kube-state-metrics; metrics-server is not bundled on DOKS and must be installed (HPA/kubectl top depend on it). HPA (§39.1 #9): MVP scales on CPU/memory via metrics-server only; RPS/latency scaling deferred (needs prometheus-adapter); clarified FR-050's display path is cAdvisor→VM, distinct from HPA's source. Retention (FR-051): split — logs via per-tier ClickHouse partition TTL; metrics global 90d on single-node VM at MVP, per-tier deferred. Distributed tracing confirmed out of MVP scope. Structure & readability: added Part 0 (Orientation) — system primer, end-to-end deploy walkthrough, a canonical-source map (one home per concept), and an MVP-scope-at-a-glance table. Softened the status line (was "Architecture Finalized" despite ~40 open items in §39). Synced the §2 diagram to the six-component customer cluster. BYOC reframed as out of scope (not offered) across §12, §17, §37, and §39 — content retained as a portability proof, not deleted; clarified the distinction from Starform-operated multi-cloud regions, which remain on the geographic-expansion roadmap. No content removed in this version. Consistency sweep: propagated the monitoring decisions into the summary tables and requirements — §4.1 (internal VictoriaMetrics, vmalert, Alertmanager, Vector aggregator added; customer-vs-internal store distinguished), the §2 infra diagram (platform-monitoring + aggregator row), §6 Tech Stack (metric sources, metrics-server, platform alerting, telemetry transport rows), and §40 (new FR-063–FR-069 for L7 metrics, tenant key, query isolation, private transport, Envoy cardinality, metrics-server, and platform self-monitoring; new SC-016–SC-018 for tenant isolation, alert latency, and metric freshness). Existing FR/SC numbering preserved (additions only). Identity propagation: carried the tenant-tuple decision through the desired-state contract (§32 — new "Identity in the payload" note; environments added to the computation SELECT), the §9 store layout (added environment.go, var_group.go, bucket.go), and the §39.1 #1 schema item (added the environments table and an explicit identity-model note: no customers table; canonical key project_id + environment + service_id; workspace_id as billing label).
1.10 June 2026 Monitoring topology revised for MVP; observability split into a self-contained SRE guide. Platform self-monitoring → Grafana Cloud (§35.5, FR-069): retired the self-hosted internal VictoriaMetrics + vmalert + Alertmanager for Grafana Cloud (hosted metrics + Grafana Alerting + OnCall) — an external watcher survives infra failure; MVP-only, revisit post-MVP (§39.3 #42). Net: per region VictoriaMetrics drops from two instances (customer + internal) to one (customer); ClickHouse unchanged. GDPR guardrail: customer identifiers excluded from platform series, or an EU-residency stack (§39.2 #22). Central control-plane region (§4.1): Starbase + Stardeck + Postgres run in one region at MVP; telemetry stores are per region. The only cross-region traffic is the desired-state pull and Starbase's dashboard reads, routed per project's region via a region→store-endpoint lookup (no fan-out, §35.4); single-region control plane is an accepted SPOF — workloads survive via Shuttle's level-driven loop (§39.3 #45). Telemetry stores on VMs (§4.1, §26.3): ClickHouse and VictoriaMetrics move from in-cluster to dedicated per-region DO VM droplets (separate droplet per store), provisioned out-of-band via Terraform/cloud-init — consistent with DO Managed Postgres/Valkey already off-cluster; backups via volume snapshots. Log pipeline → hybrid Fluent Bit → Vector (§35.1, §35.4): customer-cluster agent switched from Vector to Fluent Bit (~64Mi/node, lighter on tenant nodes, less kube-apiserver load at high namespace counts); the regional Vector aggregator keeps the native batching clickhouse sink (avoids "too many parts"), receiving via its fluent source. Arbitrary environment names (§20.2, §24.1, FR-070): no fixed dev/staging/prod enum — names are customer-chosen, validated as an RFC 1123 label (≤30 chars) because the name is load-bearing in the HTTPRoute parse and K8s labels. Preview/ephemeral environments are identified by a structural is_ephemeral flag, not name-matching — what the per-route preview-metric suppression keys off (FR-067, §39.1 #1). New open items (§39.3): self-hosted-monitoring migration (#42), cross-region query-proxy/cache (#43), multi-region aggregator placement (#44), multi-region control plane (#45). Companion doc: the Observability Architecture guide (architecture/observability.html) is rewritten as a self-contained SRE implementation guide — the SRE does not receive this PRD, so the guide inlines the observability-relevant contracts (route parse §20.2, label catalog §24, tenant key §24.1, query filter FR-065) with provenance tags ("owned by PRD §X; PRD wins on conflict") and a sync checklist, plus worked config examples per chapter. A scoped exception to the "companion docs never duplicate" rule (CLAUDE.md §6). Consistency sweep across §2 (topology diagram), §4 (control-plane region + regional telemetry tier + Fluent Bit customer cluster), §6 (tech stack), §40 (FR-067/069 amended, FR-070 added; SC-017 → Grafana).
1.11 June 2026 Cross-region read path, VPC/IP topology, and versionless rename. Control plane stays central (§4.1, §39.3 #45): one Starbase + one Postgres (Railway's model) — distributing it (Fly's model) needs a distributed DB / Corrosion-class build; out of near-term scope, SPOF accepted for MVP. Cross-region telemetry reads (§35.4, FR-071): the central control plane reaches each region's VPC-private stores over private cross-region VPC peering, through the existing authenticated front-doors — vmauth (VictoriaMetrics) + a read-only ClickHouse user — injecting the FR-065 filter itself. No new component; nothing internet-facing. The regional read-API/cache (§39.3 #43) is demoted to an optional post-MVP latency optimization. VPC & IP topology (§4.4): clusters in a region share one /16 VPC (tenant isolation is namespace/NetworkPolicy, not VPC) → peerings track regions (~12 ≪ 50/account), not clusters. VPC-native DOKS (Cilium 1.31+) because Starform resells DO Managed DBs — customer pods reach their DB directly (DB sees the pod IP; isolated via Trusted Sources scoped to the pod subnet). IP plan: per-region node /16 + per-cluster pod /18 + service /22 from a systematic 10/8 scheme, non-overlapping, in an IPAM registry (§39.3 #46) — not resizable after creation. DO viability stress-tested: every limit examined (peering 50, VPC /16, NIC bandwidth 2–25 Gbps, cluster/DB counts) is either huge headroom or a support-raised soft-quota (#47) — DO is a sound launch-and-scale foundation, with ports/adapters portability as the hedge. Rename: the master PRD is now versionless (master_prd.md); the header is the source of truth. Header → 1.11. Added FR-071; §39.3 #43 reframed, #45 updated, #46/#47 added.
1.12 June 2026 Platform self-monitoring → dedicated Grafana Alloy agent. §35.5, §4.2, §6, §26.3, FR-069: the platform plane no longer rides a vmagent remote_write split. A dedicated Grafana Alloy agent (Grafana's k8s-monitoring Helm chart) runs in each customer cluster, scrapes Starform-component series (Shuttle, Envoy gateway-self, node-exporter, KSM platform objects), and remote_writes them to Grafana Cloud — an independent path so a vmagent fault can't blind the SRE to platform health, and the standard Grafana Cloud Kubernetes onboarding. vmagent simplifies to a single destination (the regional customer VictoriaMetrics); the GDPR labeldrop of customer identifiers moves from vmagent's relabel rules to Alloy's. Net: the customer cluster goes six → seven components — the accepted trade-off (raised and committed) for an independent external watcher. Verified: Grafana Agent reached EOL Nov 2025; Alloy is its supported OpenTelemetry-collector successor. Companion doc (architecture/observability.html): rebuilt Diagrams 0/1/2/3 + bootstrap — added the Alloy box, split the read front-door into its two real parts (vmauth proxy deployment + a native read-only ClickHouse user; chproxy ruled out as unnecessary at MVP), and gave the metrics vs logs pipelines distinct arrow colours (blue = metrics, amber = logs). Header → 1.12.
1.13 June 2026 Fixed an Envoy stat-inclusion contradiction (§35.2). The cardinality matcher is a strict allowlist, so "keep only the three upstream_* customer families" silently dropped Envoy's gateway-self metrics (server.*, control_plane.*, listener_manager.*, cluster_manager.*) — the very metrics §35.2/§35.5 say must reach platform self-monitoring. §35.2 step 1 now allowlists both the per-route customer families and the bounded gateway-self families (per-instance, so no per-route cardinality cost; vmagent keeps the customer set → VictoriaMetrics, the Alloy agent keeps the gateway-self set → Grafana Cloud). Mirrored in architecture/observability.html ch.2 (EnvoyProxy example + step text + gotcha). Header → 1.13.
June 2026 Migrated to MkDocs. The single-file PRD + the hand-built observability HTML were converted into this component-first MkDocs Material site under docs/ (the new source of truth); originals deprecated under _legacy/. Diagrams converted to theme-aware Mermaid; light + dark themes from the Starform frontend tokens; the §-cross-reference web preserved via stable #section-N-M anchors + the Canonical Sources index.

This is a living document. Pages will be expanded as components are designed and built.