ADR-007: Use kube-prometheus-stack for Observability¶
Status¶
Accepted
Date¶
2026-03-07
Context¶
The sandbox needs an observability layer to monitor cluster health, resource usage, and application metrics. The solution must be CNCF-aligned, production-representative, and runnable on a local MacBook with constrained resources.
Decision¶
Use kube-prometheus-stack (Prometheus Community Helm chart), which bundles:
| Component | CNCF Status | Purpose |
|---|---|---|
| Prometheus | CNCF Graduated | Metrics collection and storage |
| Grafana | CNCF Foundation member | Dashboards and visualisation |
| Alertmanager | Part of Prometheus | Alert routing (disabled Phase 1) |
| kube-state-metrics | CNCF Sandbox | K8s object metrics |
| node-exporter | Prometheus ecosystem | Node hardware metrics |
| prometheus-operator | Bundled | Manages Prometheus via CRDs |
Resource tuning for local MacBook: - Prometheus retention: 24h / 2 GB - Grafana: 128 MB request, 256 MB limit - Alertmanager: disabled (Phase 1)
Alternatives Considered¶
| Tool | Reason Not Chosen |
|---|---|
| VictoriaMetrics | Excellent performance and storage efficiency; but less ubiquitous in production; Prometheus is the CNCF standard to learn first |
| Thanos / Cortex | Long-term Prometheus storage solutions; over-engineered for a single-node local cluster |
| OpenTelemetry Collector + Jaeger | Covers tracing (Phase 3 candidate); does not replace metrics |
| Datadog / New Relic | Commercial; not open source; not representative of self-hosted CNCF stack |
| Metrics Server only | Provides only CPU/memory for HPA; no dashboards, no persistence, no alerting |
Consequences¶
Positive¶
- One Helm chart installs the complete monitoring stack including CRDs (ServiceMonitor, PodMonitor)
- Pre-built dashboards for K8s cluster, nodes, and pods out of the box
- Learning Prometheus query language (PromQL) is directly transferable to production
- ServiceMonitor CRDs allow apps to self-register metrics scraping targets (Phase 2+)
Negative¶
- kube-prometheus-stack is a large chart (~50+ sub-components); significant RAM usage even tuned down
- Grafana admin password in plain text in values file — must not be committed with real credentials in shared/production repos
- Phase 3 will replace with SOPS-encrypted secrets
- Alertmanager disabled — Phase 1 has no alerting
Grafana Access¶
# NodePort access (no tunnel needed)
minikube service kube-prometheus-stack-grafana -n monitoring
# Or add to /etc/hosts and use ingress:
echo "127.0.0.1 grafana.local" | sudo tee -a /etc/hosts
# Open: http://grafana.local (requires sudo minikube tunnel)
# Login: admin / admin
Trade-offs¶
Learning completeness (full observability stack) is prioritised over minimal resource usage. The memory footprint (~1.5 GB for the full stack) is acceptable on a MacBook with 16+ GB RAM.