Skip to content

ADR-007: Use kube-prometheus-stack for Observability

Status

Accepted

Date

2026-03-07

Context

The sandbox needs an observability layer to monitor cluster health, resource usage, and application metrics. The solution must be CNCF-aligned, production-representative, and runnable on a local MacBook with constrained resources.

Decision

Use kube-prometheus-stack (Prometheus Community Helm chart), which bundles:

Component CNCF Status Purpose
Prometheus CNCF Graduated Metrics collection and storage
Grafana CNCF Foundation member Dashboards and visualisation
Alertmanager Part of Prometheus Alert routing (disabled Phase 1)
kube-state-metrics CNCF Sandbox K8s object metrics
node-exporter Prometheus ecosystem Node hardware metrics
prometheus-operator Bundled Manages Prometheus via CRDs

Resource tuning for local MacBook: - Prometheus retention: 24h / 2 GB - Grafana: 128 MB request, 256 MB limit - Alertmanager: disabled (Phase 1)

Alternatives Considered

Tool Reason Not Chosen
VictoriaMetrics Excellent performance and storage efficiency; but less ubiquitous in production; Prometheus is the CNCF standard to learn first
Thanos / Cortex Long-term Prometheus storage solutions; over-engineered for a single-node local cluster
OpenTelemetry Collector + Jaeger Covers tracing (Phase 3 candidate); does not replace metrics
Datadog / New Relic Commercial; not open source; not representative of self-hosted CNCF stack
Metrics Server only Provides only CPU/memory for HPA; no dashboards, no persistence, no alerting

Consequences

Positive

  • One Helm chart installs the complete monitoring stack including CRDs (ServiceMonitor, PodMonitor)
  • Pre-built dashboards for K8s cluster, nodes, and pods out of the box
  • Learning Prometheus query language (PromQL) is directly transferable to production
  • ServiceMonitor CRDs allow apps to self-register metrics scraping targets (Phase 2+)

Negative

  • kube-prometheus-stack is a large chart (~50+ sub-components); significant RAM usage even tuned down
  • Grafana admin password in plain text in values file — must not be committed with real credentials in shared/production repos
  • Phase 3 will replace with SOPS-encrypted secrets
  • Alertmanager disabled — Phase 1 has no alerting

Grafana Access

# NodePort access (no tunnel needed)
minikube service kube-prometheus-stack-grafana -n monitoring

# Or add to /etc/hosts and use ingress:
echo "127.0.0.1  grafana.local" | sudo tee -a /etc/hosts
# Open: http://grafana.local  (requires sudo minikube tunnel)
# Login: admin / admin

Trade-offs

Learning completeness (full observability stack) is prioritised over minimal resource usage. The memory footprint (~1.5 GB for the full stack) is acceptable on a MacBook with 16+ GB RAM.