ADR-007: Use kube-prometheus-stack for Observability¶

Status¶

Accepted

Date¶

2026-03-07

Context¶

The sandbox needs an observability layer to monitor cluster health, resource usage, and application metrics. The solution must be CNCF-aligned, production-representative, and runnable on a local MacBook with constrained resources.

Decision¶

Use kube-prometheus-stack (Prometheus Community Helm chart), which bundles:

Component	CNCF Status	Purpose
Prometheus	CNCF Graduated	Metrics collection and storage
Grafana	CNCF Foundation member	Dashboards and visualisation
Alertmanager	Part of Prometheus	Alert routing (disabled Phase 1)
kube-state-metrics	CNCF Sandbox	K8s object metrics
node-exporter	Prometheus ecosystem	Node hardware metrics
prometheus-operator	Bundled	Manages Prometheus via CRDs

Resource tuning for local MacBook: - Prometheus retention: 24h / 2 GB - Grafana: 128 MB request, 256 MB limit - Alertmanager: disabled (Phase 1)

Alternatives Considered¶

Tool	Reason Not Chosen
VictoriaMetrics	Excellent performance and storage efficiency; but less ubiquitous in production; Prometheus is the CNCF standard to learn first
Thanos / Cortex	Long-term Prometheus storage solutions; over-engineered for a single-node local cluster
OpenTelemetry Collector + Jaeger	Covers tracing (Phase 3 candidate); does not replace metrics
Datadog / New Relic	Commercial; not open source; not representative of self-hosted CNCF stack
Metrics Server only	Provides only CPU/memory for HPA; no dashboards, no persistence, no alerting

Consequences¶

Positive¶

One Helm chart installs the complete monitoring stack including CRDs (ServiceMonitor, PodMonitor)
Pre-built dashboards for K8s cluster, nodes, and pods out of the box
Learning Prometheus query language (PromQL) is directly transferable to production
ServiceMonitor CRDs allow apps to self-register metrics scraping targets (Phase 2+)

Negative¶

kube-prometheus-stack is a large chart (~50+ sub-components); significant RAM usage even tuned down
Grafana admin password in plain text in values file — must not be committed with real credentials in shared/production repos
Phase 3 will replace with SOPS-encrypted secrets
Alertmanager disabled — Phase 1 has no alerting

Grafana Access¶

# NodePort access (no tunnel needed)
minikube service kube-prometheus-stack-grafana -n monitoring

# Or add to /etc/hosts and use ingress:
echo "127.0.0.1  grafana.local" | sudo tee -a /etc/hosts
# Open: http://grafana.local  (requires sudo minikube tunnel)
# Login: admin / admin

Trade-offs¶

Learning completeness (full observability stack) is prioritised over minimal resource usage. The memory footprint (~1.5 GB for the full stack) is acceptable on a MacBook with 16+ GB RAM.