How to Monitor Applications in Kubernetes?

Junior Level

Simple Definition

Monitoring in Kubernetes is the practice of collecting, storing, and analyzing data about the state of the cluster, containers, and applications. Monitoring answers three questions: “What is happening?”, “Why did it happen?” and “How to prevent it?”

Analogy

Monitoring is like a car dashboard. The speedometer shows speed (CPU usage), the temperature gauge — overheating (memory usage), the CHECK ENGINE light — something is broken (error rate). Without a dashboard, you’re driving blind.

Example: Installing Prometheus Stack (Helm)

# Add repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace

Example: Spring Boot Actuator + Micrometer

<!-- pom.xml -->
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

# application.yml
management:
  endpoints:
    web:
      exposure:
        include: health,info,prometheus
  metrics:
    export:
      prometheus:
        enabled: true

kubectl Example

# Open Grafana dashboard
kubectl port-forward svc/monitoring-grafana 3000:80 -n monitoring
# http://localhost:3000 (admin/prom-operator)

# Open Prometheus UI
kubectl port-forward svc/monitoring-prometheus 9090:9090 -n monitoring

# List monitoring Pods
kubectl get pods -n monitoring
# prometheus-monitoring-0
# grafana-xxxxx
# alertmanager-monitoring-0

When to Use

Always in production — without monitoring you are blind
For tracking CPU, memory, disk, network usage
For error detection and performance degradation
For alerting: SMS/email/Slack when problems occur

Middle Level

How it Works

Monitoring in Kubernetes is built on four components:

Prometheus — time-series database, collects metrics via pull method (polls endpoints every 15-30 seconds)

Prometheus – de facto standard for monitoring in K8s. Pull model: Prometheus scrapes metrics from Pods via HTTP endpoint (/metrics).

Node Exporter — DaemonSet on each node, collects OS metrics (CPU, RAM, disk, network)

cAdvisor — built into kubelet, collects container metrics (CPU, memory, network per container)

kube-state-metrics — Deployment, generates metrics about K8s objects (Pod status, Deployment replicas, PVC usage)

Grafana — visualization, dashboards, alerting

Metrics collection chain:

App (/actuator/prometheus) ← Prometheus scrapes every 15s
Node Exporter (DaemonSet)  ← Prometheus scrapes every 15s
cAdvisor (kubelet)         ← Prometheus scrapes every 15s
kube-state-metrics         ← Prometheus scrapes every 15s
                              ↓
                          Prometheus TSDB (stores 15 days)
                              ↓
                          Grafana (dashboards)
                              ↓
                          Alertmanager (notifications)

Practical Scenarios

Scenario 1: Java application monitoring

# JVM memory usage
jvm_memory_used_bytes{area="heap"}

# GC pause time
rate(jvm_gc_pause_seconds_sum[5m])

# HTTP request rate
rate(http_server_requests_seconds_count[5m])

# HTTP error rate (5xx)
rate(http_server_requests_seconds_count{status=~"5.."}[5m])

# Request latency p99
histogram_quantile(0.99, rate(http_server_requests_seconds_bucket[5m]))

Scenario 2: Alert on 5xx error growth

# PrometheusRule
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: app-alerts
spec:
  groups:
    - name: app.rules
      rules:
        - alert: HighErrorRate
          expr: |
            sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m]))
            / sum(rate(http_server_requests_seconds_count[5m])) > 0.05
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "5xx error rate > 5% for 5 minutes"

Scenario 3: Logging via Loki

# Helm values for Loki
loki:
  commonConfig:
    replication_factor: 1
  storage:
    type: filesystem

promtail:
  enabled: true

Promtail (DaemonSet) collects logs from each node -> sends to Loki -> Grafana visualizes.

Common Mistakes Table

Mistake	Consequence	Solution
Only infrastructure metrics, no business metrics	CPU looks fine, but payments aren’t going through	Add business metrics: payments_per_second, order_processing_time
Prometheus stores metrics only 15 days	Can’t analyze long-term trends	Use Thanos/Cortex for long-term storage (S3)
Too many alerts (alert fatigue)	Team ignores alerts, misses critical ones	Reduce to 5-10 critical alerts, use alert grouping
No tracing (only metrics and logs)	Can see request is slow, but not where	Add OpenTelemetry + Jaeger/Tempo
Prometheus single point of failure	When Prometheus goes down — no metrics, no alerts	Use Prometheus HA (replica 2, Alertmanager cluster)
Metrics collected too often (every 5s)	High load on Prometheus storage and network	Set 15-30s for most metrics, 5s for critical ones

Comparison: Observability Stack

Component	Metrics	Logs	Traces
Tool	Prometheus	Loki / ELK	Jaeger / Tempo
What it shows	Numbers (CPU, latency, errors)	Events (log lines, stack traces)	Request path through services
Collection	Pull (scrape)	Push (agent -> storage)	Push (SDK -> collector)
Storage	TSDB (15 days)	Index + chunks (30 days)	Trace index + span store
Query language	PromQL	LogQL	Trace ID lookup
When to use	Trends, alerting, dashboards	Debug, audit, compliance	Distributed tracing, bottleneck detection

Monitoring (Prometheus/Grafana) – metrics: CPU, RAM, latency, error rate. Logging (ELK/Loki) – logs: container stdout/stderr. Tracing (Jaeger/Zipkin) – distributed tracing: request path through services.

When NOT to Use

Dev/local development — too heavy, use simple logs and health endpoints
Very small clusters (1-2 Pods) — Prometheus overhead may exceed the benefit
When no team to support it — monitoring requires maintenance: updating dashboards, configuring alerts, managing storage
For business analytics — Prometheus is not a data warehouse. Use ClickHouse/BigQuery for business analytics

Senior Level

Deep Mechanics: Prometheus TSDB, Scraping, and Controller Reconciliation

Prometheus Architecture: Prometheus works on a pull model — it polls targets via HTTP /metrics endpoint.

Service Discovery: Prometheus discovers targets via Kubernetes API:
- kubernetes_sd_configs with role pod, service, endpoints, node
- Automatically finds Pods with annotation prometheus.io/scrape: "true"
- Updates target list on Pod changes (via Kubernetes informer)
Scraping: Every scrape_interval (15s by default):
- HTTP GET /metrics endpoint
- Parse text format (Prometheus exposition format)
- Store in TSDB (Time-Series Database)
TSDB (Time-Series Database):
- Data stored as blocks (2-hour chunks) on disk
- Each block: index (series -> chunks), chunks (raw samples), meta.json
- WAL (Write-Ahead Log) for durability on crash
- Compaction: 2h blocks -> 4h -> 8h -> … (cardinality reduction)
- Memory-mapped files for fast reading
Prometheus Operator (CoreOS):
- Kubernetes Operator managing Prometheus via CRD: Prometheus, ServiceMonitor, PodMonitor, PrometheusRule
- ServiceMonitor — declarative scrape target definition (instead of manual config)
- Automatically generates Prometheus config from CRD

kube-state-metrics Internals: kube-state-metrics connects to API Server via informers and generates metrics about K8s object states:

kube_pod_status_phase{namespace="default", pod="my-app", phase="Running"} -> 1
kube_deployment_status_replicas{namespace="default", deployment="my-app"} -> 3
kube_persistentvolumeclaim_status_phase{namespace="default", pvc="data", phase="Bound"} -> 1

Alertmanager: Prometheus sends firing alerts to Alertmanager. Alertmanager:

Groups: Groups similar alerts (by label)
Inhibition: Suppresses dependent alerts (if node down, don’t alert on every Pod on that node)
Routing: Sends to the right receiver (Slack, PagerDuty, email)
Deduplication: HA Prometheus (replica 2) sends identical alerts, Alertmanager deduplicates

OpenTelemetry (Tracing): OpenTelemetry SDK instruments the application, collects spans (individual operations) and groups them into traces (full request path). Spans are sent via OTLP protocol to collector -> Jaeger/Tempo for storage and visualization.

Trade-offs

Aspect	Trade-off
Pull vs Push	Pull (Prometheus) = simpler service discovery, but doesn’t work for ephemeral targets. Push (StatsD) = works for batch jobs, but needs gateway
Prometheus vs VictoriaMetrics	Prometheus = standard, huge community. VictoriaMetrics = better performance, less RAM, but less mature ecosystem
Loki vs ELK	Loki = lighter, cheaper, better for K8s. ELK = more powerful full-text search, but heavier (Elasticsearch JVM)
Prometheus storage local vs remote	Local = simpler, but limited to 15-30 days. Remote (Thanos/Cortex) = long-term storage, but more complex
High cardinality labels	More labels = more precise queries, but exponentially more series -> more RAM/CPU
Scrape interval	Short (5s) = more precise, but higher load. Long (30s) = less load, but may miss spikes

Prometheus stores data locally (usually 15-30 days). For long-term storage, use Thanos or Cortex.

Edge Cases (7+)

Edge Case 1: High Cardinality Explosion

# BAD: http_requests_total{path="/users/123", method="GET", status="200"}
# path contains ID -> unique series per user -> millions of series

Cardinality explosion: Prometheus stores each unique label combination as a separate time series. 1000 users x 10 endpoints x 5 statuses = 50,000 series. This eats RAM and slows queries. Solution: use path="/users/:id" (grouping), not specific IDs.

Edge Case 2: Prometheus OOM on large series count Prometheus stores all active series in RAM. With 10 million series, Prometheus requires ~20-30GB RAM. If namespace ResourceQuota limit is lower, Prometheus is OOMKilled. Solution: --storage.tsdb.max-block-duration=2h, cardinality limits, or VictoriaMetrics (smaller RAM footprint).

Edge Case 3: Scraping target disappears before scrape completes Pod deleted during scrape. Prometheus gets connection refused or partial response. Metrics for that scrape are lost. With frequent deployments (100+ Pods/day), this creates gaps in metrics. Solution: honor_labels: true + scrape_timeout < scrape_interval.

Edge Case 4: Alertmanager notification flooding Node down -> 50 Pods on node not ready -> 50 alerts firing simultaneously. Alertmanager sends 50 Slack messages. Team gets alert fatigue. Solution: alert grouping (group_by: ['node']), inhibition rules (if node down, suppress pod alerts).

Edge Case 5: Tracing overhead OpenTelemetry with sampling: 1.0 (100% of traces) adds 5-15% overhead to each request’s latency. At high load (10K RPS), this is significant degradation. Solution: probabilistic sampling (0.1-1%), or adaptive sampling (increase sampling rate for errors).

Edge Case 6: Loki label cardinality Loki indexes only labels, not log line content. If pod_name is used as a label, each unique label combination = separate stream. With 1000 Pods = 1000 streams. Solution: use labels with lower cardinality (app, namespace), not pod_name.

Edge Case 7: Thanos/Cortex complexity Thanos adds sidecar (to Prometheus), query gateway, store gateway (S3), compactor, ruler. This is 5+ additional components. For a team of 3 DevOps, this may be overhead. Solution: start with Prometheus + 30-day retention, move to Thanos only when long-term storage is needed.

Edge Case 8: kube-state-metrics API Server load kube-state-metrics watches all K8s objects via API Server informers. With 5000 Pods, 1000 Services, 500 Deployments, informer cache takes ~500MB RAM. List/Watch operations add load to API Server. Solution: --resources flag to limit watching only needed resources.

Performance Numbers

Metric	Value
Prometheus scrape latency	5-50ms per target (depends on metric count)
Prometheus RAM per 1M series	~2-3GB
Prometheus disk per 1M series/day	~5-10GB (after compaction)
Max series per Prometheus instance	~10-20 million (depends on RAM)
Query latency (simple)	10-100ms
Query latency (complex, 7d range)	1-10 seconds
Alertmanager notification latency	1-5 seconds (from firing to notification)
Loki ingestion latency	1-3 seconds (from log write to queryable)
OpenTelemetry overhead (0.1% sampling)	<0.1% latency increase
OpenTelemetry overhead (100% sampling)	5-15% latency increase
kube-state-metrics RAM (5000 Pods)	~500MB-1GB

Security

Prometheus endpoint must not be public — /metrics exposes internal application structure. Restrict via NetworkPolicy
mTLS for scraping — if using Istio, Prometheus scrape must be excluded from mTLS or use sidecar injection
RBAC for Prometheus SA — Prometheus Service Account requires get, list, watch on Pods, Services, Endpoints. Restrict to only needed namespaces
Alertmanager webhook authentication — Slack/PagerDuty webhooks should use authentication tokens, not plaintext URLs
Loki log sanitization — logs may contain sensitive data (PII, credentials). Use log redaction (Promtail pipeline stages) before sending to Loki
Thanos/S3 encryption — long-term metric storage in S3 should be encrypted (SSE-S3 or SSE-KMS)
OpenTelemetry collector authentication — OTLP endpoint should require authentication (API key, mTLS), otherwise anyone can send fake spans

Production War Story

Situation: SaaS platform, 1000-Pod cluster, Prometheus + Grafana + Alertmanager. Application: Java/Spring Boot microservices with OpenTelemetry tracing.

Incident:

Developer added metric http_requests_total{path="/users/{id}", user_id="<actual-id>", ...} — user_id was the actual user ID, not a template
In 2 hours, cardinality grew from 500K to 15 million series (100K unique user_id)
Prometheus RAM usage grew from 8GB to 25GB
Prometheus OOMKilled (ResourceQuota limit: 20GB)
Alertmanager stopped receiving alerts — team didn’t learn about the problem
After 30 minutes, on-duty engineer noticed Grafana dashboards were empty
Prometheus restarted, but on startup replayed WAL (Write-Ahead Log) -> OOM again -> crash loop
Monitoring outage for 4 hours until high-cardinality metric was removed and RAM increased

Post-mortem and fix:

Cardinality guard — Prometheus config --storage.tsdb.max-block-duration=2h + alert on series growth rate
Metric naming convention — path="/users/:id" (template), not specific IDs. Code review for new metrics
Prometheus HA — 2 Prometheus replicas with Alertmanager deduplication
ResourceQuota for Prometheus namespace — separate namespace with guaranteed resources
Alert on Prometheus health — external health check (via synthetic monitoring), not dependent on Prometheus
Thanos for long-term storage — sidecar uploads to S3, even if Prometheus crashed

Monitoring after fix:

# Alert: Prometheus series growth rate
rate(prometheus_tsdb_head_series[1h]) > 100000  # >100K new series/hour

# Alert: Prometheus memory usage
process_resident_memory_bytes{job="prometheus"} / 20e9 > 0.8  # >80% of 20GB

# Alert: Prometheus down (external check)
up{job="prometheus"} == 0

# Alert: Alertmanager not receiving alerts
rate(alertmanager_notifications_total{status="success"}[5m]) == 0

Monitoring (Prometheus/Grafana)

Key metrics for monitoring the monitoring:

# Prometheus health
up{job="prometheus"}

# Series count (cardinality)
prometheus_tsdb_head_series

# Scrape duration
rate(prometheus_target_interval_length_seconds_sum[5m])
/ rate(prometheus_target_interval_length_seconds_count[5m])

# TSDB compaction
rate(prometheus_tsdb_compactions_total[1h])

# Alertmanager alerts
alertmanager_alerts{state="firing"}

# kube-state-metrics latency
kube_state_metrics_list_duration_seconds

Key metrics for application (Golden Signals):

# 1. Latency (p50, p95, p99)
histogram_quantile(0.50, rate(http_server_requests_seconds_bucket[5m]))
histogram_quantile(0.95, rate(http_server_requests_seconds_bucket[5m]))
histogram_quantile(0.99, rate(http_server_requests_seconds_bucket[5m]))

# 2. Traffic (RPS)
sum(rate(http_server_requests_seconds_count[5m])) by (service)

# 3. Errors (5xx rate)
sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m]))
/ sum(rate(http_server_requests_seconds_count[5m]))

# 4. Saturation (CPU, memory, disk)
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)
container_memory_working_set_bytes / container_spec_memory_limit_bytes

Grafana Dashboard panels:

Golden Signals Overview: Latency p50/p95/p99, Traffic RPS, Error Rate, Saturation (CPU/Memory)
JVM Metrics: Heap/Non-Heap memory, GC pause time, thread count, class loading
Kubernetes Overview: Pod status, Deployment replicas, PVC usage, Node resources
Alerting Overview: Firing alerts by severity, alert rate, alertmanager notification latency
Tracing Overview (Tempo/Jaeger): Trace count, error trace rate, slowest endpoints
Prometheus Self-Monitoring: Series count, scrape duration, TSDB size, memory usage

Highload Best Practices

Golden Signals: Latency, Traffic, Errors, Saturation — always monitor these 4 metrics
Cardinality management — don’t use high-cardinality labels (user_id, request_id). Use templates: path="/users/:id"
Prometheus HA — 2 Prometheus replicas + Alertmanager with deduplication
Scrape interval: 15s for most, 5s for critical — balance between accuracy and load
Retention: 15 days local + Thanos/Cortex for long-term — S3 storage for compliance and trend analysis
Alert routing by severity:
- Critical -> PagerDuty (immediately)
- Warning -> Slack (during work hours)
- Info -> Email (daily digest)
Inhibition rules — if node down, don’t alert on every Pod on that node
OpenTelemetry sampling: 0.1-1% for production, 100% for errors
Loki label cardinality — use app, namespace, not pod_name
Monitor the monitoring — external health check for Prometheus/Alertmanager, not dependent on them
Prometheus ResourceQuota — separate namespace with guaranteed 20-30GB RAM for 10M series
Dashboard as code — Grafana dashboards in Git (JSON), deployed via CI/CD
SLO/SLI tracking — define Service Level Objectives (99.9% availability, p99 < 500ms) and track error budget
Regular alert review — monthly review of firing alerts, remove noisy alerts, add missing ones

Interview Cheat Sheet

Must know:

Observability = Metrics (Prometheus) + Logs (Loki/ELK) + Traces (Jaeger/Tempo)
Prometheus — pull model, time-series DB; scrapes /metrics every 15-30 seconds
Golden Signals: Latency, Traffic, Errors, Saturation — always monitor
kube-prometheus-stack = Prometheus + Grafana + Alertmanager + Node Exporter + kube-state-metrics
Cardinality explosion — main Prometheus problem (high-cardinality labels = OOM)
For Java: JVM memory, GC pause, HTTP error rate, request latency (histogram_quantile)
Alert routing by severity: Critical -> PagerDuty, Warning -> Slack, Info -> digest

Common follow-up questions:

“Why is cardinality explosion dangerous?” — Each unique label combination = series; millions of series -> OOM
“Prometheus HA — why?” — Single point of failure; 2 replicas + Alertmanager deduplication
“Pull vs Push?” — Pull (Prometheus) = simpler service discovery; Push (StatsD) = for batch jobs
“How to monitor the monitoring?” — External health check for Prometheus/Alertmanager, not dependent on them

Red flags (DO NOT say):

“Prometheus stores metrics forever” (locally 15-30 days; Thanos/Cortex for long-term)
“I only monitor CPU/RAM” (need business metrics: error rate, latency, throughput)
“100% sampling for tracing in production” (5-15% overhead; use 0.1-1%)
“Alert fatigue is normal” (reduce to 5-10 critical; otherwise team ignores)

Related topics:

[[Why are health checks needed]] — health endpoints for monitoring
[[How does scaling work in Kubernetes]] — custom metrics for HPA
[[What is Kubernetes and why is it needed]] — Control Plane monitoring