Skip to content

Not Yet Designed

The following require detailed design before or during implementation. Priority tiers reflect MVP blocking, launch blocking, and post-launch scope. (Source: §39.) When a design touches one of these, surface it.

§39.1 MVP-Blocking — Must Design Before Building Starbase

  1. Database schema — Postgres tables for users, workspaces, projects, environments, services, deployments, builds, clusters, billing, regions, var_groups, var_group_entries, service_var_groups, github_installations, webhook_deliveries (for dedup), buckets (object storage primitive). RBAC tables (users, workspaces, workspace_members, projects, project_members, environments) defined in §15.7. Identity model (v1.9): there is no customers table — the entity hierarchy is workspace → project → environment → service. The canonical tenant key carried through desired state, snapshots, metrics, and logs is project_id + environment + service_id, with workspace_id as the billing-boundary label; the stale customer_id field was removed throughout. Desired state serialization code (internal/service/desiredstate.go) is downstream of this work. Environments are customer-named, validated as an RFC 1123 label (≤30 chars, §20.2); the environments table carries an is_ephemeral flag so preview/ephemeral environments are identified structurally, not by name-matching — this is what the per-route preview-metric suppression keys off (FR-067).
  2. Starbase API endpoints — Full REST API for dashboard, detailed request/response contracts for all resources (services, deployments, var groups, custom domains, buckets, databases, etc.)
  3. Customer onboarding flow — Signup → create workspace → create project → connect Git → first deploy (end-to-end happy path)
  4. Deployments & rollbacks — Deployment versioning model, rollback UX (one-click to previous version), zero-downtime rolling deploy semantics, deployment history retention, surge limits, build state machine (specified in §16.12)
  5. Health checks — Customer-configurable startup/liveness/readiness probes (HTTP path, TCP, gRPC); how Shuttle/Gateway wait for readiness before shifting traffic
  6. Service types — Distinguish web services (HTTP-serving with HTTPRoute), background workers (no ingress), cron jobs (K8s CronJob resource, not Deployment). Each renders differently in Shuttle.
  7. Volumes / persistent storage — DO Block Storage CSI driver configuration, volume attach/resize/snapshot workflows, survives pod restart/node failure
  8. Wildcard / auto-generated domains — DNS zone for *.starform.app or equivalent, TLS cert issuance for wildcards, routing to the right pod, subdomain conflict handling
  9. Horizontal autoscaling (HPA) — Per-service scaling rules, min/max replica limits, scale-up/scale-down cooldown. MVP scope: CPU/memory only, driven by the Kubernetes Metrics API (metrics-server — confirmed not bundled on DOKS, installed at bootstrap per §26.3). Scaling on RPS/latency is deferred — custom-metric HPA requires a prometheus-adapter (or KEDA) exposing the regional VictoriaMetrics through the custom-metrics API, which has no path today. Note that FR-050's customer-facing CPU/memory display comes from the cAdvisor → vmagent → VictoriaMetrics path (§35.2) and is separate from HPA's metrics-server source — do not conflate the two. Separate from cluster autoscaling (already designed) and scale-to-zero (post-MVP).
  10. Internal / private networking — Service-to-service DNS pattern, public-vs-private service flag, cross-environment network isolation (§20.4), NetworkPolicy enforcement
  11. Live log streaming UX — WebSocket or SSE streaming from ClickHouse to dashboard, CLI logs -f support, filtering by level/service/time, search
  12. Build overrides — Custom build command, pre-deploy command (e.g., DB migrations), start command — how users override Railpack defaults
  13. Customer-facing metrics dashboard — Default dashboards per service (request rate, error rate, latency, CPU, memory), ability to build custom dashboards
  14. Stripe integration detail — Invoice lifecycle, payment failure handling, dunning flow, refunds, proration edge cases
  15. Build pipeline error handling — Build timeouts, Dockerfile/Railpack build failures, image size limits, retry semantics
  16. Managed object storage primitive (Tigris) — S3-compatible buckets provisioned per environment via Tigris Partner Integration API. One Tigris Organization per Starform workspace (isolated tenants). Credential injection into services via auto-generated system Secret (endpoint, access key ID, secret access key). Public vs. private visibility flag. Credential display in dashboard after provisioning. Zero egress fees pass through to customer. Pre-launch action: negotiate partner pricing and SLA with Tigris sales (help@tigrisdata.com) — avoid hallucinating economics before terms are confirmed.
  17. Per-environment branch configuration — Explicit mapping from Git branch to environment (e.g., main → production, develop → staging). Default auto-deploy on push to configured branch; opt-out flag per service-config.
  18. Encryption-at-rest catalog — Cross-cutting spec covering all encrypted fields in Postgres: Var Group entry values (§38.2), DB credentials, bucket credentials, GitHub installation tokens, and cloud-provider credentials/kubeconfigs Starform holds for its own clusters and future multi-cloud regions (§12). All use AES-256-GCM with a single key from ENCRYPTION_KEY env var; key rotation is a post-MVP batch job.

§39.2 Pre-Launch-Blocking — Must Exist Before Public Launch

Pre-launch-blocking

These gate public launch (SOC 2 / enterprise expectations), even though they're not needed to build the MVP.

  1. Audit logs — Who did what and when, across all customer-facing operations. Required for SOC 2, expected by enterprise
  2. API tokens / service accounts — Scoped tokens for CI/CD integration (deploy from GitHub Actions without user credentials), per-machine tokens
  3. Status page / SLA / incident response — Public status page (status.starform.io), incident response runbooks, SLA commitments for uptime
  4. GDPR / account deletion / data retention — Right-to-be-forgotten workflow, account closure data cleanup, data export before deletion. Includes confirming Grafana Cloud platform-telemetry data residency — either an EU-residency stack or enforced exclusion of customer identifiers from the platform series (§35.5).
  5. Rate limiting — Per-customer API quotas (target 100 req/min/user), deploy rate limits, build queue limits, protection from runaway customer impact. Redis-backed token bucket implementation.
  6. Admission policy for label enforcement — Kyverno or custom admission webhook rejecting pods missing required starform.io/* labels. Prevents silent NetworkPolicy bypass (§20.4 failure mode).

§39.3 Post-MVP / Seed-Stage

Post-MVP / seed-stage

Deferred beyond MVP. Preview environments (#28) are Decision pending — ship at launch (Railway parity) or defer to seed.

  1. Auth system — Session management, JWT issuance for dashboard (15-min access + rotatable refresh tokens), SSO callback handling, password reset, MFA
  2. Cluster bootstrap workflow — Step-by-step: DOKS creation → Envoy Gateway → metrics-server + kube-state-metrics → Fluent Bit + vmagent + Grafana Alloy → Shuttle → DNS → active. Plus the out-of-band per-region provisioning (Terraform/cloud-init) of the ClickHouse + VictoriaMetrics VM droplets and the Vector aggregator (§4.1, §26.3)
  3. Custom domains — DNS validation, Cloudflare API integration, HTTPRoute lifecycle, cert issuance via cert-manager
  4. Preview environments — Ephemeral environments per PR, full replica of production with Git-branch-based lifecycle, flagged is_ephemeral on the environments table (§39.1 #1). Open: the per-route Envoy L7 metric handling for previews (short retention vs pre-aggregation vs suppression — FR-067), to be architected in the Observability Architecture doc. Strategic decision: ship at launch (Railway parity) or defer to seed stage
  5. Scale-to-zero — Free tier only, HPAScaleToZero (K8s 1.36), custom activator for request buffering
  6. Auth primitive — JWKS endpoint hosting, SecurityPolicy lifecycle, claim-to-header mapping UI
  7. Email primitive — AWS SES integration, per-customer domain management, bounce handling
  8. Queue primitive — NATS as customer-facing managed service
  9. CLI toolstarform deploy, starform logs, starform status, device code auth flow
  10. Platform-side disaster recovery — Control plane Postgres backup/restore, registry compromise response, runbooks
  11. Multi-region failover — Region drain workflow, customer migration between regions, SLA commitments
  12. Platform upgrade procedures — Rolling out new Shuttle versions across cluster fleet, K8s version upgrades, Envoy Gateway upgrades
  13. Capacity planning & quota enforcement — Cluster-level quota (1000 pods soft limit), auto-provision next cluster at threshold, signup throttling
  14. External secret store integration — AWS Secrets Manager, HashiCorp Vault as customer-facing secret-store backends (lets a customer source Var Group values from their own vault). Post-MVP. (Not BYOC, which is out of scope, §37.)
  15. Per-environment value overrides in Var Groups — currently requires separate groups per environment
  16. Project-to-cluster migration workflow — Moving a project between clusters (e.g., region change) requires coordinated source drain + target provision; not in MVP scope
  17. Encryption key rotation — Batch job to re-encrypt all encrypted fields with new key without downtime
  18. Self-hosted platform monitoring — migrate platform self-monitoring off Grafana Cloud (MVP-only, §35.5) to a self-hosted stack once the SRE team can operate it; re-evaluate cost vs. the external-watcher benefit
  19. Cross-region telemetry read-API / cache (optional) — the base design has Starbase read regional stores directly over private peering (§35.4). If out-of-region read latency or volume grows, an optional thin regional read-API/cache (private, co-located with the stores) removes the cross-region RTT. Not MVP-blocking, and not required by the base design
  20. Multi-region Vector aggregator placement — at MVP (single region) the regional Vector aggregator sits in the infra footprint; for added regions, decide whether each region's aggregator runs on a VM droplet (alongside its stores) or in a small regional platform cluster
  21. Multi-region control plane — the control plane is central by decision (Railway's model; one Starbase + one Postgres) and a single-region SPOF at MVP (§4.1). Running workloads survive an outage via Shuttle's level-driven loop (§32), but new deploys + dashboards stop. Distributing it (Fly's model) needs a distributed/replicated DB — out of near-term scope. Design control-plane HA before relying on multi-region
  22. IPAM / subnet allocation registry — VPC-native DOKS needs non-overlapping pod/service subnets per cluster plus per-region /16 VPCs, all from a systematic 10.0.0.0/8 plan that can't be resized after creation (§4.4). Maintain an allocation registry (region → VPC CIDR; cluster → pod/service CIDRs) so new clusters/regions never collide
  23. Account-quota runbook — DO soft-quotas (Droplets 25, Managed DB clusters 10, K8s clusters gated by Droplets) gate scale before any VPC/IP limit does; raise them via support ahead of growth. Document the request cadence + thresholds