How to monitor a distributed microservices system
4. Saturation — how full resources are (CPU, memory)
🟢 Junior Level
Monitoring is observing the health of all services in real time.
Four key signals (Golden Signals):
- Latency — how long a request takes
- Traffic — how many requests per second
- Errors — how many errors per second
- Saturation — how full resources are (CPU, memory)
Tools:
- Prometheus — metrics collection
- Grafana — visualization
- Alertmanager — alerts
🟡 Middle Level
Metrics
Application metrics:
- Request rate (req/s)
- Error rate (%)
- Latency (p50, p95, p99)
- Active connections
- Thread pool usage
Infrastructure metrics:
- CPU usage
- Memory usage
- Disk I/O
- Network I/O
Prometheus + Grafana
// Micrometer for Spring Boot
@RestController
public class OrderController {
private final Counter orderCounter;
private final Timer orderTimer;
public OrderController(MeterRegistry registry) {
this.orderCounter = registry.counter("orders.created");
this.orderTimer = registry.timer("orders.processing.time");
}
@PostMapping("/orders")
public Order createOrder(@RequestBody OrderRequest req) {
return orderTimer.record(() -> {
// MeterRegistry — Micrometer interface (Spring Boot Actuator).
// record() — measures lambda execution time and writes to Timer.
orderCounter.increment();
return orderService.create(req);
});
}
}
Common mistakes
- Too many alerts:
100 alerts per hour → alert fatigue → real ones are missed Solution: only actionable alerts // Actionable alert = if triggered, the engineer knows WHAT to do. // If an alert fires and there's nothing to do — it's noise, remove or improve it.
🔴 Senior Level
RED method
For services:
- Rate: requests per second
- Errors: failed requests per second
- Duration: request latency distribution
USE method
For resources:
- Utilization: average time resource was busy
- Saturation: amount of work queued
- Errors: error count
Production Experience
Prometheus config:
scrape_configs:
- job_name: 'user-service'
metrics_path: '/actuator/prometheus'
scrape_interval: 15s
- job_name: 'order-service'
metrics_path: '/actuator/prometheus'
scrape_interval: 15s
Alerting rules:
groups:
- name: service_alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
# This is >5% error rate, not >0.05 requests/sec.
for: 5m
labels:
severity: critical
Best Practices
✅ Golden Signals
✅ RED/USE methods
✅ Only actionable alerts
✅ Dashboard per service
✅ SLI/SLO tracking
❌ Too many alerts
❌ Without dashboards
❌ Without SLO
🎯 Interview Cheat Sheet
Must know:
- Golden Signals: Latency, Traffic, Errors, Saturation
- Prometheus — metrics collection, Grafana — visualization, Alertmanager — alerts
- Micrometer + Spring Boot Actuator — standard for Java applications
- RED method (Rate, Errors, Duration) — for services
- USE method (Utilization, Saturation, Errors) — for resources
- Only actionable alerts — if triggered, the engineer knows exactly what action to take
- p50, p95, p99 latency — p99 shows worst-case user experience
Frequent follow-up questions:
- What is an actionable alert? An alert where the engineer knows a specific action. Without action = noise, remove it.
- RED vs USE? RED — for services (Rate, Errors, Duration), USE — for resources (Utilization, Saturation, Errors).
- Why p99 instead of average? Average hides outliers — p99 shows the experience of the worst 1% of users.
- What is alert fatigue? 100 alerts per hour → real ones are missed → only actionable alerts.
Red flags (NOT to say):
- “More alerts = better” — no, alert fatigue leads to missing real issues
- “Average latency is enough” — no, it hides outliers, p99 is important
- “Metrics = logs” — no, metrics = aggregated numbers, logs = detail
- “Prometheus without Grafana is fine” — without dashboards, you can’t respond quickly
Related topics:
- [[22. What is distributed tracing]]
- [[21. How to monitor a distributed microservices system]]
- [[5. What is Circuit Breaker pattern]]
- [[17. How to ensure fault tolerance of microservices]]
- [[26. What tools are used for microservices orchestration]]