Question 21 · Section 17

How to monitor a distributed microservices system

4. Saturation — how full resources are (CPU, memory)

Language versions: English Russian Ukrainian

🟢 Junior Level

Monitoring is observing the health of all services in real time.

Four key signals (Golden Signals):

  1. Latency — how long a request takes
  2. Traffic — how many requests per second
  3. Errors — how many errors per second
  4. Saturation — how full resources are (CPU, memory)

Tools:

  • Prometheus — metrics collection
  • Grafana — visualization
  • Alertmanager — alerts

🟡 Middle Level

Metrics

Application metrics:

- Request rate (req/s)
- Error rate (%)
- Latency (p50, p95, p99)
- Active connections
- Thread pool usage

Infrastructure metrics:

- CPU usage
- Memory usage
- Disk I/O
- Network I/O

Prometheus + Grafana

// Micrometer for Spring Boot
@RestController
public class OrderController {
    private final Counter orderCounter;
    private final Timer orderTimer;

    public OrderController(MeterRegistry registry) {
        this.orderCounter = registry.counter("orders.created");
        this.orderTimer = registry.timer("orders.processing.time");
    }

    @PostMapping("/orders")
    public Order createOrder(@RequestBody OrderRequest req) {
        return orderTimer.record(() -> {
            // MeterRegistry — Micrometer interface (Spring Boot Actuator).
            // record() — measures lambda execution time and writes to Timer.
            orderCounter.increment();
            return orderService.create(req);
        });
    }
}

Common mistakes

  1. Too many alerts:
    100 alerts per hour → alert fatigue → real ones are missed
    Solution: only actionable alerts
    // Actionable alert = if triggered, the engineer knows WHAT to do.
    // If an alert fires and there's nothing to do — it's noise, remove or improve it.
    

🔴 Senior Level

RED method

For services:
- Rate: requests per second
- Errors: failed requests per second
- Duration: request latency distribution

USE method

For resources:
- Utilization: average time resource was busy
- Saturation: amount of work queued
- Errors: error count

Production Experience

Prometheus config:

scrape_configs:
  - job_name: 'user-service'
    metrics_path: '/actuator/prometheus'
    scrape_interval: 15s

  - job_name: 'order-service'
    metrics_path: '/actuator/prometheus'
    scrape_interval: 15s

Alerting rules:

groups:
- name: service_alerts
  rules:
  - alert: HighErrorRate
    expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
    # This is >5% error rate, not >0.05 requests/sec.
    for: 5m
    labels:
      severity: critical

Best Practices

✅ Golden Signals
✅ RED/USE methods
✅ Only actionable alerts
✅ Dashboard per service
✅ SLI/SLO tracking

❌ Too many alerts
❌ Without dashboards
❌ Without SLO

🎯 Interview Cheat Sheet

Must know:

  • Golden Signals: Latency, Traffic, Errors, Saturation
  • Prometheus — metrics collection, Grafana — visualization, Alertmanager — alerts
  • Micrometer + Spring Boot Actuator — standard for Java applications
  • RED method (Rate, Errors, Duration) — for services
  • USE method (Utilization, Saturation, Errors) — for resources
  • Only actionable alerts — if triggered, the engineer knows exactly what action to take
  • p50, p95, p99 latency — p99 shows worst-case user experience

Frequent follow-up questions:

  • What is an actionable alert? An alert where the engineer knows a specific action. Without action = noise, remove it.
  • RED vs USE? RED — for services (Rate, Errors, Duration), USE — for resources (Utilization, Saturation, Errors).
  • Why p99 instead of average? Average hides outliers — p99 shows the experience of the worst 1% of users.
  • What is alert fatigue? 100 alerts per hour → real ones are missed → only actionable alerts.

Red flags (NOT to say):

  • “More alerts = better” — no, alert fatigue leads to missing real issues
  • “Average latency is enough” — no, it hides outliers, p99 is important
  • “Metrics = logs” — no, metrics = aggregated numbers, logs = detail
  • “Prometheus without Grafana is fine” — without dashboards, you can’t respond quickly

Related topics:

  • [[22. What is distributed tracing]]
  • [[21. How to monitor a distributed microservices system]]
  • [[5. What is Circuit Breaker pattern]]
  • [[17. How to ensure fault tolerance of microservices]]
  • [[26. What tools are used for microservices orchestration]]