Question 26 · Section 15

How to monitor consumer lag in Kafka

Imagine a checkout line at a store. The cashier (consumer) has served 80 customers, and there are still 200 waiting in line. Lag = 200 — how many customers are still waiting. If...

Language versions: English Russian Ukrainian

Junior Level

Simple Definition

Consumer lag is the difference between the latest message in a Kafka partition and the last message the consumer processed. It shows how far behind the consumer is.

Analogy

Imagine a checkout line at a store. The cashier (consumer) has served 80 customers, and there are still 200 waiting in line. Lag = 200 — how many customers are still waiting. If the lag is growing — the cashier can’t keep up, you need to open another register.

Visualization

Partition 0:
  Log End Offset (latest in Kafka):  1000
  Committed Offset (processed):       800
  ────────────────────────────────────────
  Consumer Lag:                        200

  [0......100......200......800|←200→|1000]
  |----- processed -----| lag |

How to Check Lag

# Kafka CLI utility
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group order-service

GROUP           TOPIC    PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
order-service   orders   0          800             1000            200
order-service   orders   1          750             900             150
order-service   orders   2          900             950              50

Total lag: 400 messages

What Lag Means

Lag What It Means What to Do
0 Consumer processed everything Nothing, all good
1–100 Normal backlog Monitor
100–1000 Consumer is slowing down Check throughput
1000+ Consumer can’t keep up Scale up
Growing fast Critical problem Alert + investigate

(these values depend on throughput — lag must be evaluated TOGETHER with producer rate. Lag 1000 at 100 msg/s = 10 seconds behind; at 1M msg/s = negligible)

When This Matters

  • Production monitoring — always monitor lag
  • Scaling decisions — when to add consumers
  • Health checks — lag is a health indicator
  • SLA monitoring — guarantee processing time

Middle Level

How Lag Calculation Works

Lag = LogEndOffset - CommittedOffset

LogEndOffset:
  - The last offset written to the partition
  - Updated on every producer send
  - Stored on the broker

CommittedOffset:
  - The last offset committed by the consumer
  - Stored in __consumer_offsets topic
  - Updated on consumer.commitSync/Async

Lag calculation:
  - Computed by external tools (not Kafka itself)
  - Kafka broker provides LogEndOffset
  - Monitoring tool reads committed offsets
  - Lag = the difference

Monitoring Tools

1. Kafka CLI

# All consumer groups
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list

# Details of a specific group
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group order-service

# Reset offsets (careful!)
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --group order-service --reset-offsets --to-latest --execute

2. Burrow (LinkedIn)

# Burrow HTTP API
GET http://burrow-server:8000/v3/kafka/cluster/consumer/order-service/status

Response:
{
  "error": false,
  "message": "consumer status returned",
  "status": {
    "cluster": "production",
    "group": "order-service",
    "status": "OK",
    "complete": 1.0,
    "partitions": [
      {
        "topic": "orders",
        "partition": 0,
        "status": "OK",
        "start": {"offset": 700, "timestamp": 1712000000000},
        "end": {"offset": 1000, "timestamp": 1712000100000},
        "current_lag": 200,
        "max_lag": 300
      }
    ],
    "partition_count": 3
  }
}

Burrow status values: | Status | Meaning | | ——– | ———————— | | OK | Lag is decreasing or 0 | | WARN | Lag is stable, high | | ERROR | Lag is growing | | STOP | Consumer is not committing | | STALL | Lag is not changing | | REWIND | Offset moved backwards |

3. Prometheus + kafka_exporter

# docker-compose.yml
services:
  kafka-exporter:
    image: danielqsj/kafka-exporter
    command:
      - '--kafka.server=kafka:9092'
    ports:
      - "9308:9308"

  prometheus:
    image: prom/prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"
# prometheus.yml
scrape_configs:
  - job_name: 'kafka'
    static_configs:
      - targets: ['kafka-exporter:9308']

4. Grafana Dashboard

Key panels:

  • Consumer lag per partition (time series)
  • Total lag across partitions (single stat)
  • Lag rate (derivative — lag growing speed)
  • Consumer throughput (messages/sec)
  • Producer vs Consumer throughput comparison

Common Errors Table

Error Symptoms Consequences Solution
Monitoring only total lag Problematic partitions not visible One partition lags, others OK — problem is hidden Lag per partition
Lag without context Lag = 500 — is this good or bad? Wrong decisions Lag + throughput + trend
Without alerting Lag is growing, nobody sees Consumer can’t keep up, data delay Alert thresholds
Lag = 0 = OK assumption Lag = 0 but consumer is stopped False sense of security Monitor consumer status
Only current lag No history Trend analysis is impossible Store lag history
Monitoring on producer side Lag is a consumer metric, not producer Irrelevant data Monitor consumer lag

Usage Scenarios

Scenario 1: Lag grows linearly

Time     Lag
10:00    100
10:05    300
10:10    500
10:15    700
→ Linear growth = consumer slower than producer

Root cause:
  - Consumer processing slow (DB bottleneck, external API)
  - GC pauses
  - Network latency to broker

Action: Scale consumers or optimize processing

Scenario 2: Lag spike then recovery

Time     Lag
10:00    100
10:05    5000   ← spike
10:10    3000
10:15    500    ← recovery
10:20    100

Root cause:
  - Temporary broker unavailability
  - Network partition resolved
  - GC pause ended

Action: Investigate spike cause, may not need scaling

Scenario 3: Lag grows exponentially

Time     Lag
10:00    100
10:05    200
10:10    800
10:15    3200
10:20    12800
→ Exponential = consumer completely stuck

Root cause:
  - Poison pill message
  - Deadlock in consumer
  - External dependency down

Action: URGENT — find root cause, may need to skip messages

When NOT to Use Lag as the Only Metric

  • Lag = 0 but consumer is processing slowly — may spike on burst
  • Lag is high but decreasing — consumer is catching up, don’t panic
  • Lag without producer rate — lag may be high with low producer activity
  • Lag without error rate — lag may grow due to errors, not slow processing

Senior Level

Deep Internals

Lag Calculation Internals

Kafka broker:
  Partition State:
    - LogEndOffset: offset the next message will get
    - HighWatermark: offset of the last committed message (all ISR)
    - LastStableOffset: for transactional messages

Consumer:
  CommittedOffset: stored in __consumer_offsets topic
  Position: next offset to be read (may be ahead of committed)

Lag (per partition) = LogEndOffset - CommittedOffset

__consumer_offsets topic:
  - Internal Kafka topic
  - Compacted cleanup (only latest offset per group+partition)
  - 50 partitions (default)
  - Key: [group.id, topic, partition]
  - Value: {offset, metadata, leaderEpoch, timestamp}

Offset Commit Lag vs Processing Lag

Offset Commit Lag:
  CommittedOffset = 800
  ProcessingOffset = 850  (processed but not committed)
  LogEndOffset = 1000

  "Visible lag" = 1000 - 800 = 200
  "Real lag" = 1000 - 850 = 150

  Difference = 50 messages in-flight (processed, not committed)

This matters for monitoring:
  - Burrow sees committed offset → lag = 200
  - Real processing lag = 150
  - In-flight messages are not visible to external tools

Kafka Exporter Internals

kafka_exporter (Prometheus):
  1. AdminClient.ListConsumerGroupOffsets() → committed offsets
  2. AdminClient.ListOffsets(latest) → log end offsets
  3. Calculates lag
  4. Exposes as Prometheus metrics

Metrics exposed:
  kafka_consumergroup_lag{group="order-service",topic="orders",partition="0"}
  kafka_consumergroup_lag_sum{group="order-service",topic="orders"}
  kafka_topic_partition_current_offset{topic="orders",partition="0"}
  kafka_consumergroup_current_offset{group="order-service",topic="orders",partition="0"}

Scrape interval: 10–30s (not more often — expensive admin calls)

Trade-offs

Approach Accuracy Overhead Real-time Complexity
CLI (kafka-consumer-groups) High Manual Point-in-time Low
Burrow High Low ~5s delay Medium
kafka_exporter + Prometheus Medium Low ~10s delay Medium
JMX (kafka.consumer) High Medium Real-time High
Custom monitoring Maximum Varies Real-time High

Edge Cases

1. Lag = 0 but consumer is not working:

Scenario:
  Consumer processed up to offset 1000
  No new messages produced for the last 10 min
  Lag = LogEndOffset(1000) - CommittedOffset(1000) = 0

  Looks OK, but the consumer may be dead!

Resolution:
  1. Monitor consumer heartbeat (session.timeout.ms)
  2. Monitor processing rate (messages/sec)
  3. Monitor "time since last commit"
  4. Burrow status: STOP if no heartbeat

2. Lag spike during rebalancing:

Scenario:
  10:00: Consumer-1 processes partition 0, lag = 100
  10:01: Consumer-2 joins → rebalancing
  10:02: Partition 0 assigned to Consumer-2
  10:02: Lag spike: 100 → 500 (no processing during rebalance)
  10:05: Consumer-2 catches up, lag = 50

This is expected behavior!
  - Lag spike during rebalance = normal
  - Lag recovery after rebalance = healthy

Monitoring:
  Ignore lag spikes < 5 min duration
  Alert on lag spike > 10 min without recovery

3. Consumer rewind (lag sudden increase):

Scenario:
  Committed offset: 1000
  Next poll: committed offset = 500 (rewind!)
  Lag suddenly: 500 → 1500

Root causes:
  1. Offset reset (auto.offset.reset = earliest)
  2. Manual seek (consumer.seek)
  3. Offset data loss (__consumer_offsets corruption)
  4. Consumer group reset

Burrow status: REWIND
Action:
  Investigate — rewind may be intentional (replay)
  Or data loss (critical!)

4. Transactional Messages and Lag:

Scenario:
  Producer started a transaction
  Wrote messages at offsets 1000-1050
  Transaction NOT committed yet
  LogEndOffset = 1051
  LastStableOffset = 1000

Consumer lag calculation:
  If consumer isolation.level = read_committed:
    Sees only committed messages up to offset 1000
    Lag = 1000 - committed_offset  (correct)
  If consumer isolation.level = read_uncommitted:
    Sees all messages including uncommitted
    Lag = 1051 - committed_offset  (misleading!)

Resolution: monitoring should use LastStableOffset
         for read_committed consumers

5. Multi-cluster Lag:

Scenario:
  MirrorMaker 2 replicates: Cluster A → Cluster B
  Consumer reads from Cluster B

  Lag on Cluster B may grow due to:
  1. Consumer slow (normal lag)
  2. Replication lag (Cluster A → Cluster B slow)
  3. Producer slow on Cluster A

Need to monitor:
  - Consumer lag on Cluster B
  - Replication lag (A → B)
  - Producer lag on Cluster A

Only then can you determine the root cause

Performance (Production Numbers)

Metric Normal Warning Critical
Lag per partition < 1000 1000–10000 > 10000
Lag growth rate < 100/min 100–1000/min > 1000/min
Consumer throughput > 90% of producer 50–90% < 50%
Commit latency (P99) < 50ms 50–200ms > 200ms
Time to recover lag < 5 min 5–30 min > 30 min

Production War Story

Situation: Delivery platform, 500K events/min, 20 partitions, 10 consumers.

Problem: At 2 AM, lag started growing: 1000 → 50000 in 30 minutes. By 3 AM — 200000 lag. On-call engineer received an alert and started investigation. By 4 AM — 500000 lag, 1 hour backlog.

Investigation timeline:

02:00 — Alert: lag > 10000
02:05 — On-call checks Grafana: lag growing exponentially
02:10 — Consumer throughput dropped: 50K → 5K msg/min
02:15 — No consumer restarts, no rebalancing
02:20 — DB connection pool exhausted (root cause!)
02:25 — DB team: migration running, connection limit reached
02:30 — Migration stopped, connections freed
02:35 — Consumers catch up
03:30 — Lag back to 0

Root cause: DB team ran a migration without notification. The migration used all connections. Consumers couldn’t write to the DB → processing stopped → lag grew.

Lessons:

  1. Lag alert fired — GOOD
  2. But lag didn’t say WHY — DB connection monitoring was needed
  3. Cross-team communication process needed
  4. Lag growth rate alert (> 10K/min) — more actionable
  5. Runbook for lag investigation — critical

Post-mortem improvements:

Alerts added:
  - lag > 10K for 5m → warning
  - lag > 100K for 2m → critical
  - lag growth rate > 10K/min → critical
  - consumer throughput < 50% of producer → warning
  - DB connection pool > 90% → critical

Runbook created:
  1. Check lag per partition (is it all or one?)
  2. Check consumer throughput vs producer rate
  3. Check consumer logs for errors
  4. Check external dependencies (DB, API)
  5. Check recent deployments

Monitoring (JMX, Prometheus, Burrow)

JMX Metrics

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=consumer-1
  - records-lag-max: max lag across all partitions
  - records-lag-avg: average lag
  - fetch-latency-avg: time to fetch from broker
  - fetch-rate: fetch requests per second

kafka.consumer:type=consumer-coordinator-metrics,client-id=consumer-1
  - commit-latency-avg
  - commit-rate
  - last-heartbeat-seconds-ago

kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
  (if partitions are under-replicated → consumer may see stale data)

Prometheus + Grafana Dashboard

# Recording rules
- record: kafka_consumer_lag_rate
  expr: deriv(kafka_consumergroup_lag[5m])

- record: kafka_consumer_throughput
  expr: rate(kafka_consumergroup_current_offset[5m])

- record: kafka_producer_throughput
  expr: rate(kafka_topic_partition_current_offset[5m])

# Alerts
- alert: KafkaConsumerLagHigh
  expr: kafka_consumergroup_lag_sum > 10000
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Consumer lag is high (> 10K)"

- alert: KafkaConsumerLagGrowing
  expr: kafka_consumer_lag_rate > 1000
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Consumer lag growing at /min"

- alert: KafkaConsumerThroughputLow
  expr: kafka_consumer_throughput / kafka_producer_throughput < 0.5
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Consumer throughput < 50% of producer throughput"

Burrow Configuration

# burrow.ini
[consumer.order-service]
class-name=kafka
cluster=production
group=order-service

# Burrow evaluation
[lagcheck.order-service]
group-order-service.orders.0=OK
group-order-service.orders.1=OK
group-order-service.orders.2=OK

# Notification
[notifier.default]
class-name=email
threshold=1
send-close=true
template=lag-email.tmpl

# HTTP endpoint
[httpserver.default]
address=:8000

Burrow evaluation statuses:

OK:      Lag decreasing or 0
WARN:    Lag stable but above threshold
ERROR:   Lag increasing
STOP:    Consumer stopped committing
STALL:   Lag not changing (consumer stuck)
REWIND:  Offset moved backwards

Highload Best Practices

✅ Lag per partition monitoring (not only total)
✅ Lag growth rate monitoring (deriv/lag rate)
✅ Lag + throughput + producer rate together
✅ Burrow status monitoring (OK/WARN/ERROR/STOP/STALL)
✅ Alert thresholds by severity (warning, critical)
✅ Runbook for lag investigation
✅ Auto-scaling on lag thresholds (KEDA, custom HPA)
✅ Historical lag data (trend analysis)
✅ Multi-cluster lag (replication lag)
✅ Transactional message awareness (LastStableOffset)

❌ Lag without context (throughput, producer rate)
❌ Lag without alerting (useless metric)
❌ Total lag without per-partition breakdown
❌ Lag monitoring without a runbook
❌ Lag without historical data (no trend analysis)
❌ Lag on production without auto-scaling plan
❌ Lag monitoring without consumer status (dead consumer, lag=0)

Auto-scaling Based on Lag (KEDA)

# KEDA ScaledObject for auto-scaling Kafka consumers
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: kafka-consumer-scaler
spec:
  scaleTargetRef:
    name: order-consumer-deployment
  minReplicaCount: 3
  maxReplicaCount: 20
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka:9092
        consumerGroup: order-service
        topic: orders
        lagThreshold: "5000"    # scale up when lag > 5000
        activationLagThreshold: "1000"  # activate when lag > 1000
        offsetRestPolicy: "latest"
        allowIdleConsumers: "false"

Architectural Decisions

  1. Lag is a leading indicator — grows before users are impacted
  2. Lag + throughput + error rate — three pillars of consumer health
  3. Burrow status > lag number — status trend is more actionable than absolute lag
  4. Auto-scaling on lag — proactive scaling, not reactive
  5. Lag per partition — partition-level visibility is critical

Summary for Senior

  • Lag = LogEndOffset - CommittedOffset, with nuances (in-flight, transactional)
  • Burrow status (OK/WARN/ERROR/STOP/STALL/REWIND) is more informative than lag value
  • Lag growth rate is more actionable than absolute lag
  • Lag spike during rebalancing is normal, don’t alert
  • Lag = 0 doesn’t mean consumer is healthy (check heartbeat, throughput)
  • Transactional messages — use LastStableOffset for lag calculation
  • Auto-scaling on lag thresholds — KEDA for Kubernetes
  • Production runbook for lag investigation — mandatory
  • Lag per partition — total lag hides partition-level issues
  • Multi-cluster: consumer lag + replication lag + producer lag — all three are needed

🎯 Interview Cheat Sheet

Must know:

  • Consumer Lag = LogEndOffset - CommittedOffset — key consumer health metric
  • Lag per partition is critical: total lag hides problematic partitions (hot key)
  • Burrow status: OK, WARN, ERROR, STOP, STALL, REWIND — more informative than absolute lag
  • Lag growth rate (deriv) — more actionable than absolute lag value
  • Lag = 0 doesn’t mean healthy: consumer may be stopped, no new messages
  • Tools: kafka-consumer-groups.sh, Burrow, Prometheus + kafka_exporter, Grafana
  • Auto-scaling on lag thresholds: KEDA for Kubernetes

Common follow-up questions:

  • Lag=0 — is this always OK? — No, consumer may be dead; check heartbeat and throughput.
  • What does REWIND mean in Burrow? — Offset moved backwards (replay, reset, or data loss).
  • Lag spike during rebalancing — is this a problem? — No, expected behavior; ignore < 5 min duration.
  • How to identify a hot partition? — Lag imbalance across partitions: one lags, others are OK.

Red flags (DO NOT say):

  • “Lag=0 means everything is fine” — consumer may be stopped
  • “Only total lag is enough” — hides partition-level problems
  • “Lag spike during rebalance is a critical problem” — normal behavior
  • “Burrow is not needed, CLI is enough” — CLI = point-in-time, Burrow = trend analysis

Related topics:

  • [[12. What is offset in Kafka]]
  • [[13. How does offset commit work]]
  • [[24. How to handle errors when reading messages]]
  • [[25. What is DLQ (Dead Letter Queue)]]