Metrics pipeline¶
Part of the self-contained SRE guide
This guide is self-contained — you do not need the PRD open. The contracts it depends on
(the HTTPRoute route-name format, the starform.io/* label set, the tenant key) are restated
inline below with provenance tags. Owned by PRD §35.2 / §20.2 / §24.1 — the PRD wins on conflict.
In plain words
Seven customer metrics come from two sources, joined on one identity. The gateway (Envoy) is
the only thing that sees request timing, so latency, RPS, throughput, and error rate come from
it. CPU, memory, and network come from the machine's own measurements (cAdvisor). They only line
up on a dashboard because both end up tagged with the same project · environment · service key.
| Metric | Source | Signal |
|---|---|---|
| Latency (p50/p95/p99) | Envoy | upstream_rq_time histogram |
| RPS | Envoy | upstream_rq_total rate |
| Error rate | Envoy | upstream_rq_xx (5xx ÷ total) |
| Throughput | Envoy | upstream_cx_(rx\|tx)_bytes_total |
| CPU | cAdvisor | container_cpu_* |
| Memory | cAdvisor | container_memory_* |
| Network | cAdvisor | container_network_* |
Diagram 3 — Attribution. The two customer-facing paths reach the same key by different means: Envoy by parsing the route name, cAdvisor by joining against kube-state-metrics. Direct label promotion applies only to the optional app-metrics path. vmagent then writes the customer series to the regional VictoriaMetrics — a single destination. Platform self-monitoring is a separate pipeline: a dedicated Grafana Alloy agent ships Starform-component series to Grafana Cloud (ch.5).
How to build it
One vmagent per cluster scrapes three sources and writes to one store. In order:
1 · Stand up the store (per region). Run single-node VictoriaMetrics on the regional VM
droplet (90d retention, bound to localhost), and put vmauth in front of it as the only
VPC-facing door — one per-cluster bearer token routing ingest + reads, nothing else exposed:
victoria-metrics-prod -retentionPeriod=90d -storageDataPath=/var/lib/vm \
-httpListenAddr=127.0.0.1:8428 # localhost only — vmauth is the sole VPC-facing door
# per-cluster token; ingest + read; nothing else public
users:
- bearer_token: "<cluster-token>"
url_map:
- { src_paths: ["/insert/.*"], url_prefix: "http://127.0.0.1:8428/" } # vmagent ingest
- { src_paths: ["/select/.*","/api/.*"], url_prefix: "http://127.0.0.1:8428/" } # Starbase reads
2 · Trim Envoy at the source. The stat-inclusion matcher on the EnvoyProxy CRD (not vmagent) is a strict allowlist, so it names two sets: the per-route customer families — cutting ~100–150 series/route to ~7 — and Envoy's bounded gateway-self health metrics, which platform monitoring needs (Alloy → Grafana Cloud, ch.5). Custom web-latency buckets replace Envoy's minute/hour defaults:
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
spec:
bootstrap:
type: Merge # patch Envoy's StatsConfig
value: |
stats_config:
stats_matcher:
inclusion_list:
patterns:
# customer L7, per route → vmagent → VictoriaMetrics
- safe_regex: { regex: ".*upstream_rq_time.*" }
- safe_regex: { regex: ".*upstream_rq_xx.*" }
- safe_regex: { regex: ".*upstream_cx_(rx|tx)_bytes_total.*" }
# gateway-self health, bounded (per-instance) → Alloy → Grafana Cloud (ch.5)
- safe_regex: { regex: "server\\..*" }
- safe_regex: { regex: "control_plane\\..*" }
- safe_regex: { regex: "listener_manager\\..*" }
- safe_regex: { regex: "cluster_manager\\..*" }
histogram_bucket_settings: # replace Envoy's minute/hour defaults
- match: { safe_regex: { regex: ".*upstream_rq_time.*" } }
buckets: [5, 10, 25, 50, 100, 250, 1000, 5000, 10000]
3 · Expose the pod labels on kube-state-metrics. KSM is the only thing that surfaces a pod's
labels as a metric (kube_pod_labels), and it hides custom labels unless you allowlist them — so
set this on the KSM Helm chart (installed at bootstrap, ch.8). Without it, the cAdvisor join in
step 5 silently matches nothing:
# values for the kube-state-metrics chart (ch.8) → becomes the --metric-labels-allowlist arg
metricLabelsAllowlist:
- pods=[starform.io/project-id,starform.io/environment,starform.io/service-id]
4 · Deploy vmagent — one agent, three scrape jobs, each scraping one source and keeping only what it needs.
Envoy (gateway pods) attaches identity itself: metric_relabel_configs parse
project / service / environment out of envoy_cluster_name (the route name) with a
fixed-width positional regex. A drop-list skips is_ephemeral (preview) routes — they don't get
per-route L7 metrics in the MVP:
# envoy_cluster_name = httproute/<ns>/<project32><service32>-<env>/rule/0
metric_relabel_configs:
- source_labels: [envoy_cluster_name]
regex: 'httproute/[^/]+/([0-9a-f]{32})([0-9a-f]{32})-(.+)/rule/.*'
replacement: '$1'
target_label: project_id
- source_labels: [envoy_cluster_name]
regex: 'httproute/[^/]+/([0-9a-f]{32})([0-9a-f]{32})-(.+)/rule/.*'
replacement: '$2'
target_label: service_id
- source_labels: [envoy_cluster_name]
regex: 'httproute/[^/]+/([0-9a-f]{32})([0-9a-f]{32})-(.+)/rule/.*'
replacement: '$3'
target_label: environment
kubelet / cAdvisor (each node, HTTPS + bearer) — keep only the container CPU / memory / network
series. No identity is attached here: cAdvisor knows only (namespace, pod), so step 5 joins it
to the labels.
- job_name: cadvisor
scheme: https
tls_config: { insecure_skip_verify: true } # or ca_file: the kubelet CA
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs: [{ role: node }]
relabel_configs:
- { target_label: __metrics_path__, replacement: /metrics/cadvisor }
metric_relabel_configs:
- { source_labels: [__name__], regex: 'container_(cpu|memory|network)_.*', action: keep }
kube-state-metrics — keep kube_pod_labels, the series that carries the starform.io/* pod
labels (from step 3) and is the right-hand side of the step-5 join.
- job_name: kube-state-metrics
kubernetes_sd_configs: [{ role: endpoints }]
relabel_configs:
- { source_labels: [__meta_kubernetes_service_name], regex: kube-state-metrics, action: keep }
metric_relabel_configs:
- { source_labels: [__name__], regex: kube_pod_labels, action: keep }
5 · Attribute cAdvisor by joining against KSM. cAdvisor's container_* series carry only
(namespace, pod), so a vmalert recording rule joins them to kube_pod_labels on
(namespace, pod), copies the starform.io/* labels across with group_left, renames them to
clean project_id / environment / service_id keys, and records a flat starform:service:*
series — so Stardeck reads a ready-made per-service metric instead of running the join on every
dashboard load:
# KSM exposes pod labels as label_starform_io_* ; join, rename, precompute per-service
groups:
- name: starform-attribution
rules:
- record: starform:service:cpu
expr: |
sum by (project_id, environment, service_id) (
label_replace(label_replace(label_replace(
rate(container_cpu_usage_seconds_total{container!=""}[5m])
* on (namespace, pod) group_left(
label_starform_io_project_id,
label_starform_io_environment,
label_starform_io_service_id)
kube_pod_labels{label_starform_io_managed_by="shuttle"},
"project_id", "$1", "label_starform_io_project_id", "(.*)"),
"environment","$1", "label_starform_io_environment","(.*)"),
"service_id", "$1", "label_starform_io_service_id", "(.*)")
)
6 · Ship to the store. vmagent's single -remoteWrite.url → the regional vmauth/
VictoriaMetrics with the cluster token. Platform / Starform-component series are not vmagent's
job — a separate Alloy agent handles those (ch.4 · ch.5).
7 · Verify. A web service shows latency p50/p95/p99, RPS, error-rate, throughput (Envoy) and
CPU/mem/network (cAdvisor), all tagged project·env·service; preview routes are suppressed.
FR-063/064/067 · SC-018
Gotchas & what lives elsewhere
- The string parse is coupled to Envoy Gateway's cluster-name format. It won't drift between
deploys, but it can break on an EG version bump. Pin the EG version, snapshot the exact
/stats/prometheusoutput in a test, re-verify on every upgrade. Envoy's nativestats_tagsis not more robust — same parse, just inside Envoy. - The join fails if the right side isn't unique per
(namespace, pod). A duplicate label set fails the rule evaluation outright (the documented kube-prometheus "multiple matches" error). Guarantee onekube_pod_labelsrow per pod. - kube-state-metrics must be told to expose your labels — set
--metric-labels-allowlist=pods=[starform.io/...]. Forget it and the join silently matches nothing. - Envoy series trimming lives on the EnvoyProxy CRD (stat-inclusion matcher + custom histogram
buckets), not vmagent. The matcher is a strict allowlist: it cuts per-route stats from ~100–150
to ~7, but anything unlisted is never emitted — so it must also keep Envoy's bounded
gateway-self families (
server.*,control_plane.*,listener_manager.*,cluster_manager.*), or platform monitoring of the gateway goes blind. vmagent's keep/drop is belt-and-suspenders on the customer set.
PRD reference & inlined contracts
Owned by §35.2 (sources, attribution, cardinality), §20.2 (HTTPRoute name format), §24.1 (label set & tenant key); FR-063 / FR-064 / FR-067. The route-name parse and the label set are restated above so this guide stands alone — if they ever diverge, the PRD pages win. Canonical map: Canonical Sources.