Not Yet Designed¶

The following require detailed design before or during implementation. Priority tiers reflect MVP blocking, launch blocking, and post-launch scope. (Source: §39.) When a design touches one of these, surface it.

§39.1 MVP-Blocking — Must Design Before Building Starbase¶

Database schema — Postgres tables for users, workspaces, projects, environments, services, deployments, builds, clusters, billing, regions, var_groups, var_group_entries, service_var_groups, github_installations, webhook_deliveries (for dedup), buckets (object storage primitive). RBAC tables (users, workspaces, workspace_members, projects, project_members, environments) defined in §15.7. Identity model (v1.9): there is no customers table — the entity hierarchy is workspace → project → environment → service. The canonical tenant key carried through desired state, snapshots, metrics, and logs is project_id + environment + service_id, with workspace_id as the billing-boundary label; the stale customer_id field was removed throughout. Desired state serialization code (internal/service/desiredstate.go) is downstream of this work. Environments are customer-named, validated as an RFC 1123 label (≤30 chars, §20.2); the environments table carries an is_ephemeral flag so preview/ephemeral environments are identified structurally, not by name-matching — this is what the per-route preview-metric suppression keys off (FR-067).
Starbase API endpoints — Full REST API for dashboard, detailed request/response contracts for all resources (services, deployments, var groups, custom domains, buckets, databases, etc.)
Customer onboarding flow — Signup → create workspace → create project → connect Git → first deploy (end-to-end happy path)
Deployments & rollbacks — Deployment versioning model, rollback UX (one-click to previous version), zero-downtime rolling deploy semantics, deployment history retention, surge limits, build state machine (specified in §16.12)
Health checks — Customer-configurable startup/liveness/readiness probes (HTTP path, TCP, gRPC); how Shuttle/Gateway wait for readiness before shifting traffic
Service types — Distinguish web services (HTTP-serving with HTTPRoute), background workers (no ingress), cron jobs (K8s CronJob resource, not Deployment). Each renders differently in Shuttle.
Volumes / persistent storage — DO Block Storage CSI driver configuration, volume attach/resize/snapshot workflows, survives pod restart/node failure
Wildcard / auto-generated domains — DNS zone for *.starform.app or equivalent, TLS cert issuance for wildcards, routing to the right pod, subdomain conflict handling
Horizontal autoscaling (HPA) — Per-service scaling rules, min/max replica limits, scale-up/scale-down cooldown. MVP scope: CPU/memory only, driven by the Kubernetes Metrics API (metrics-server — confirmed not bundled on DOKS, installed at bootstrap per §26.3). Scaling on RPS/latency is deferred — custom-metric HPA requires a prometheus-adapter (or KEDA) exposing the regional VictoriaMetrics through the custom-metrics API, which has no path today. Note that FR-050's customer-facing CPU/memory display comes from the cAdvisor → vmagent → VictoriaMetrics path (§35.2) and is separate from HPA's metrics-server source — do not conflate the two. Separate from cluster autoscaling (already designed) and scale-to-zero (post-MVP).
Internal / private networking — Service-to-service DNS pattern, public-vs-private service flag, cross-environment network isolation (§20.4), NetworkPolicy enforcement
Live log streaming UX — WebSocket or SSE streaming from ClickHouse to dashboard, CLI logs -f support, filtering by level/service/time, search
Build overrides — Custom build command, pre-deploy command (e.g., DB migrations), start command — how users override Railpack defaults
Customer-facing metrics dashboard — Default dashboards per service (request rate, error rate, latency, CPU, memory), ability to build custom dashboards
Stripe integration detail — Invoice lifecycle, payment failure handling, dunning flow, refunds, proration edge cases
Build pipeline error handling — Build timeouts, Dockerfile/Railpack build failures, image size limits, retry semantics
Managed object storage primitive (Tigris) — S3-compatible buckets provisioned per environment via Tigris Partner Integration API. One Tigris Organization per Starform workspace (isolated tenants). Credential injection into services via auto-generated system Secret (endpoint, access key ID, secret access key). Public vs. private visibility flag. Credential display in dashboard after provisioning. Zero egress fees pass through to customer. Pre-launch action: negotiate partner pricing and SLA with Tigris sales (help@tigrisdata.com) — avoid hallucinating economics before terms are confirmed.
Per-environment branch configuration — Explicit mapping from Git branch to environment (e.g., main → production, develop → staging). Default auto-deploy on push to configured branch; opt-out flag per service-config.
Encryption-at-rest catalog — Cross-cutting spec covering all encrypted fields in Postgres: Var Group entry values (§38.2), DB credentials, bucket credentials, GitHub installation tokens, and cloud-provider credentials/kubeconfigs Starform holds for its own clusters and future multi-cloud regions (§12). All use AES-256-GCM with a single key from ENCRYPTION_KEY env var; key rotation is a post-MVP batch job.

§39.2 Pre-Launch-Blocking — Must Exist Before Public Launch¶

Pre-launch-blocking

These gate public launch (SOC 2 / enterprise expectations), even though they're not needed to build the MVP.

Audit logs — Who did what and when, across all customer-facing operations. Required for SOC 2, expected by enterprise
API tokens / service accounts — Scoped tokens for CI/CD integration (deploy from GitHub Actions without user credentials), per-machine tokens
Status page / SLA / incident response — Public status page (status.starform.io), incident response runbooks, SLA commitments for uptime
GDPR / account deletion / data retention — Right-to-be-forgotten workflow, account closure data cleanup, data export before deletion. Includes confirming Grafana Cloud platform-telemetry data residency — either an EU-residency stack or enforced exclusion of customer identifiers from the platform series (§35.5).
Rate limiting — Per-customer API quotas (target 100 req/min/user), deploy rate limits, build queue limits, protection from runaway customer impact. Redis-backed token bucket implementation.
Admission policy for label enforcement — Kyverno or custom admission webhook rejecting pods missing required starform.io/* labels. Prevents silent NetworkPolicy bypass (§20.4 failure mode).

§39.3 Post-MVP / Seed-Stage¶

Post-MVP / seed-stage

Deferred beyond MVP. Preview environments (#28) are Decision pending — ship at launch (Railway parity) or defer to seed.

Auth system — Session management, JWT issuance for dashboard (15-min access + rotatable refresh tokens), SSO callback handling, password reset, MFA
Cluster bootstrap workflow — Step-by-step: DOKS creation → Envoy Gateway → metrics-server + kube-state-metrics → Fluent Bit + vmagent + Grafana Alloy → Shuttle → DNS → active. Plus the out-of-band per-region provisioning (Terraform/cloud-init) of the ClickHouse + VictoriaMetrics VM droplets and the Vector aggregator (§4.1, §26.3)
Custom domains — DNS validation, Cloudflare API integration, HTTPRoute lifecycle, cert issuance via cert-manager
Preview environments — Ephemeral environments per PR, full replica of production with Git-branch-based lifecycle, flagged is_ephemeral on the environments table (§39.1 #1). Open: the per-route Envoy L7 metric handling for previews (short retention vs pre-aggregation vs suppression — FR-067), to be architected in the Observability Architecture doc. Strategic decision: ship at launch (Railway parity) or defer to seed stage
Scale-to-zero — Free tier only, HPAScaleToZero (K8s 1.36), custom activator for request buffering
Auth primitive — JWKS endpoint hosting, SecurityPolicy lifecycle, claim-to-header mapping UI
Email primitive — AWS SES integration, per-customer domain management, bounce handling
Queue primitive — NATS as customer-facing managed service
CLI tool — starform deploy, starform logs, starform status, device code auth flow
Platform-side disaster recovery — Control plane Postgres backup/restore, registry compromise response, runbooks
Multi-region failover — Region drain workflow, customer migration between regions, SLA commitments
Platform upgrade procedures — Rolling out new Shuttle versions across cluster fleet, K8s version upgrades, Envoy Gateway upgrades
Capacity planning & quota enforcement — Cluster-level quota (1000 pods soft limit), auto-provision next cluster at threshold, signup throttling
External secret store integration — AWS Secrets Manager, HashiCorp Vault as customer-facing secret-store backends (lets a customer source Var Group values from their own vault). Post-MVP. (Not BYOC, which is out of scope, §37.)
Per-environment value overrides in Var Groups — currently requires separate groups per environment
Project-to-cluster migration workflow — Moving a project between clusters (e.g., region change) requires coordinated source drain + target provision; not in MVP scope
Encryption key rotation — Batch job to re-encrypt all encrypted fields with new key without downtime
Self-hosted platform monitoring — migrate platform self-monitoring off Grafana Cloud (MVP-only, §35.5) to a self-hosted stack once the SRE team can operate it; re-evaluate cost vs. the external-watcher benefit
Cross-region telemetry read-API / cache (optional) — the base design has Starbase read regional stores directly over private peering (§35.4). If out-of-region read latency or volume grows, an optional thin regional read-API/cache (private, co-located with the stores) removes the cross-region RTT. Not MVP-blocking, and not required by the base design
Multi-region Vector aggregator placement — at MVP (single region) the regional Vector aggregator sits in the infra footprint; for added regions, decide whether each region's aggregator runs on a VM droplet (alongside its stores) or in a small regional platform cluster
Multi-region control plane — the control plane is central by decision (Railway's model; one Starbase + one Postgres) and a single-region SPOF at MVP (§4.1). Running workloads survive an outage via Shuttle's level-driven loop (§32), but new deploys + dashboards stop. Distributing it (Fly's model) needs a distributed/replicated DB — out of near-term scope. Design control-plane HA before relying on multi-region
IPAM / subnet allocation registry — VPC-native DOKS needs non-overlapping pod/service subnets per cluster plus per-region /16 VPCs, all from a systematic 10.0.0.0/8 plan that can't be resized after creation (§4.4). Maintain an allocation registry (region → VPC CIDR; cluster → pod/service CIDRs) so new clusters/regions never collide
Account-quota runbook — DO soft-quotas (Droplets 25, Managed DB clusters 10, K8s clusters gated by Droplets) gate scale before any VPC/IP limit does; raise them via support ahead of growth. Document the request cadence + thresholds