Skip to content

Metrics pipeline

Part of the self-contained SRE guide

This guide is self-contained — you do not need the PRD open. The contracts it depends on (the HTTPRoute route-name format, the starform.io/* label set, the tenant key) are restated inline below with provenance tags. Owned by PRD §35.2 / §20.2 / §24.1 — the PRD wins on conflict.

In plain words

Seven customer metrics come from two sources, joined on one identity. The gateway (Envoy) is the only thing that sees request timing, so latency, RPS, throughput, and error rate come from it. CPU, memory, and network come from the machine's own measurements (cAdvisor). They only line up on a dashboard because both end up tagged with the same project · environment · service key.

Metric Source Signal
Latency (p50/p95/p99) Envoy upstream_rq_time histogram
RPS Envoy upstream_rq_total rate
Error rate Envoy upstream_rq_xx (5xx ÷ total)
Throughput Envoy upstream_cx_(rx\|tx)_bytes_total
CPU cAdvisor container_cpu_*
Memory cAdvisor container_memory_*
Network cAdvisor container_network_*
Three sources → three ways to attach identity → one key Envoy GWlatency · RPS · errors kubelet / cAdvisorCPU · mem · net app /metricsoptional, opt-in PARSE read envoy_cluster_name string apart → project / service / env JOIN (namespace,pod) ⋈ kube_pod_labels recording rule · KSM load-bearing PROMOTE pod already carries the labels vmagentscrape + relabel+ recording rules one label key project·env·service VictoriaMetrics — customerworkload series remote_write platform series → separate Alloy agent (ch.5)

Diagram 3 — Attribution. The two customer-facing paths reach the same key by different means: Envoy by parsing the route name, cAdvisor by joining against kube-state-metrics. Direct label promotion applies only to the optional app-metrics path. vmagent then writes the customer series to the regional VictoriaMetrics — a single destination. Platform self-monitoring is a separate pipeline: a dedicated Grafana Alloy agent ships Starform-component series to Grafana Cloud (ch.5).

How to build it

One vmagent per cluster scrapes three sources and writes to one store. In order:

1 · Stand up the store (per region). Run single-node VictoriaMetrics on the regional VM droplet (90d retention, bound to localhost), and put vmauth in front of it as the only VPC-facing door — one per-cluster bearer token routing ingest + reads, nothing else exposed:

VictoriaMetrics launch · the metrics-store VM
victoria-metrics-prod -retentionPeriod=90d -storageDataPath=/var/lib/vm \
  -httpListenAddr=127.0.0.1:8428          # localhost only — vmauth is the sole VPC-facing door
vmauth -auth.config · the only VPC-facing door
# per-cluster token; ingest + read; nothing else public
users:
  - bearer_token: "<cluster-token>"
    url_map:
      - { src_paths: ["/insert/.*"],           url_prefix: "http://127.0.0.1:8428/" }   # vmagent ingest
      - { src_paths: ["/select/.*","/api/.*"], url_prefix: "http://127.0.0.1:8428/" }   # Starbase reads

2 · Trim Envoy at the source. The stat-inclusion matcher on the EnvoyProxy CRD (not vmagent) is a strict allowlist, so it names two sets: the per-route customer families — cutting ~100–150 series/route to ~7 — and Envoy's bounded gateway-self health metrics, which platform monitoring needs (Alloy → Grafana Cloud, ch.5). Custom web-latency buckets replace Envoy's minute/hour defaults:

EnvoyProxy CRD · kubectl apply
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
spec:
  bootstrap:
    type: Merge                       # patch Envoy's StatsConfig
    value: |
      stats_config:
        stats_matcher:
          inclusion_list:
            patterns:
              # customer L7, per route → vmagent → VictoriaMetrics
              - safe_regex: { regex: ".*upstream_rq_time.*" }
              - safe_regex: { regex: ".*upstream_rq_xx.*" }
              - safe_regex: { regex: ".*upstream_cx_(rx|tx)_bytes_total.*" }
              # gateway-self health, bounded (per-instance) → Alloy → Grafana Cloud (ch.5)
              - safe_regex: { regex: "server\\..*" }
              - safe_regex: { regex: "control_plane\\..*" }
              - safe_regex: { regex: "listener_manager\\..*" }
              - safe_regex: { regex: "cluster_manager\\..*" }
        histogram_bucket_settings:     # replace Envoy's minute/hour defaults
          - match: { safe_regex: { regex: ".*upstream_rq_time.*" } }
            buckets: [5, 10, 25, 50, 100, 250, 1000, 5000, 10000]

3 · Expose the pod labels on kube-state-metrics. KSM is the only thing that surfaces a pod's labels as a metric (kube_pod_labels), and it hides custom labels unless you allowlist them — so set this on the KSM Helm chart (installed at bootstrap, ch.8). Without it, the cAdvisor join in step 5 silently matches nothing:

kube-state-metrics · ksm.values.yaml
# values for the kube-state-metrics chart (ch.8) → becomes the --metric-labels-allowlist arg
metricLabelsAllowlist:
  - pods=[starform.io/project-id,starform.io/environment,starform.io/service-id]

4 · Deploy vmagent — one agent, three scrape jobs, each scraping one source and keeping only what it needs.

Envoy (gateway pods) attaches identity itself: metric_relabel_configs parse project / service / environment out of envoy_cluster_name (the route name) with a fixed-width positional regex. A drop-list skips is_ephemeral (preview) routes — they don't get per-route L7 metrics in the MVP:

vmagent scrape config · the Envoy job
# envoy_cluster_name = httproute/<ns>/<project32><service32>-<env>/rule/0
metric_relabel_configs:
  - source_labels: [envoy_cluster_name]
    regex: 'httproute/[^/]+/([0-9a-f]{32})([0-9a-f]{32})-(.+)/rule/.*'
    replacement: '$1'
    target_label: project_id
  - source_labels: [envoy_cluster_name]
    regex: 'httproute/[^/]+/([0-9a-f]{32})([0-9a-f]{32})-(.+)/rule/.*'
    replacement: '$2'
    target_label: service_id
  - source_labels: [envoy_cluster_name]
    regex: 'httproute/[^/]+/([0-9a-f]{32})([0-9a-f]{32})-(.+)/rule/.*'
    replacement: '$3'
    target_label: environment

kubelet / cAdvisor (each node, HTTPS + bearer) — keep only the container CPU / memory / network series. No identity is attached here: cAdvisor knows only (namespace, pod), so step 5 joins it to the labels.

vmagent scrape config · the cAdvisor job
- job_name: cadvisor
  scheme: https
  tls_config: { insecure_skip_verify: true }       # or ca_file: the kubelet CA
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  kubernetes_sd_configs: [{ role: node }]
  relabel_configs:
    - { target_label: __metrics_path__, replacement: /metrics/cadvisor }
  metric_relabel_configs:
    - { source_labels: [__name__], regex: 'container_(cpu|memory|network)_.*', action: keep }

kube-state-metrics — keep kube_pod_labels, the series that carries the starform.io/* pod labels (from step 3) and is the right-hand side of the step-5 join.

vmagent scrape config · the KSM job
- job_name: kube-state-metrics
  kubernetes_sd_configs: [{ role: endpoints }]
  relabel_configs:
    - { source_labels: [__meta_kubernetes_service_name], regex: kube-state-metrics, action: keep }
  metric_relabel_configs:
    - { source_labels: [__name__], regex: kube_pod_labels, action: keep }

5 · Attribute cAdvisor by joining against KSM. cAdvisor's container_* series carry only (namespace, pod), so a vmalert recording rule joins them to kube_pod_labels on (namespace, pod), copies the starform.io/* labels across with group_left, renames them to clean project_id / environment / service_id keys, and records a flat starform:service:* series — so Stardeck reads a ready-made per-service metric instead of running the join on every dashboard load:

vmalert recording rule · rules.yml
# KSM exposes pod labels as label_starform_io_* ; join, rename, precompute per-service
groups:
  - name: starform-attribution
    rules:
      - record: starform:service:cpu
        expr: |
          sum by (project_id, environment, service_id) (
            label_replace(label_replace(label_replace(
              rate(container_cpu_usage_seconds_total{container!=""}[5m])
              * on (namespace, pod) group_left(
                  label_starform_io_project_id,
                  label_starform_io_environment,
                  label_starform_io_service_id)
                kube_pod_labels{label_starform_io_managed_by="shuttle"},
              "project_id", "$1", "label_starform_io_project_id", "(.*)"),
              "environment","$1", "label_starform_io_environment","(.*)"),
              "service_id", "$1", "label_starform_io_service_id", "(.*)")
          )

6 · Ship to the store. vmagent's single -remoteWrite.url → the regional vmauth/ VictoriaMetrics with the cluster token. Platform / Starform-component series are not vmagent's job — a separate Alloy agent handles those (ch.4 · ch.5).

7 · Verify. A web service shows latency p50/p95/p99, RPS, error-rate, throughput (Envoy) and CPU/mem/network (cAdvisor), all tagged project·env·service; preview routes are suppressed. FR-063/064/067 · SC-018

Gotchas & what lives elsewhere

  • The string parse is coupled to Envoy Gateway's cluster-name format. It won't drift between deploys, but it can break on an EG version bump. Pin the EG version, snapshot the exact /stats/prometheus output in a test, re-verify on every upgrade. Envoy's native stats_tags is not more robust — same parse, just inside Envoy.
  • The join fails if the right side isn't unique per (namespace, pod). A duplicate label set fails the rule evaluation outright (the documented kube-prometheus "multiple matches" error). Guarantee one kube_pod_labels row per pod.
  • kube-state-metrics must be told to expose your labels — set --metric-labels-allowlist=pods=[starform.io/...]. Forget it and the join silently matches nothing.
  • Envoy series trimming lives on the EnvoyProxy CRD (stat-inclusion matcher + custom histogram buckets), not vmagent. The matcher is a strict allowlist: it cuts per-route stats from ~100–150 to ~7, but anything unlisted is never emitted — so it must also keep Envoy's bounded gateway-self families (server.*, control_plane.*, listener_manager.*, cluster_manager.*), or platform monitoring of the gateway goes blind. vmagent's keep/drop is belt-and-suspenders on the customer set.

PRD reference & inlined contracts

Owned by §35.2 (sources, attribution, cardinality), §20.2 (HTTPRoute name format), §24.1 (label set & tenant key); FR-063 / FR-064 / FR-067. The route-name parse and the label set are restated above so this guide stands alone — if they ever diverge, the PRD pages win. Canonical map: Canonical Sources.