How to monitor consumer lag in Kafka
Imagine a checkout line at a store. The cashier (consumer) has served 80 customers, and there are still 200 waiting in line. Lag = 200 — how many customers are still waiting. If...
Junior Level
Simple Definition
Consumer lag is the difference between the latest message in a Kafka partition and the last message the consumer processed. It shows how far behind the consumer is.
Analogy
Imagine a checkout line at a store. The cashier (consumer) has served 80 customers, and there are still 200 waiting in line. Lag = 200 — how many customers are still waiting. If the lag is growing — the cashier can’t keep up, you need to open another register.
Visualization
Partition 0:
Log End Offset (latest in Kafka): 1000
Committed Offset (processed): 800
────────────────────────────────────────
Consumer Lag: 200
[0......100......200......800|←200→|1000]
|----- processed -----| lag |
How to Check Lag
# Kafka CLI utility
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--describe --group order-service
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG
order-service orders 0 800 1000 200
order-service orders 1 750 900 150
order-service orders 2 900 950 50
Total lag: 400 messages
What Lag Means
| Lag | What It Means | What to Do |
|---|---|---|
| 0 | Consumer processed everything | Nothing, all good |
| 1–100 | Normal backlog | Monitor |
| 100–1000 | Consumer is slowing down | Check throughput |
| 1000+ | Consumer can’t keep up | Scale up |
| Growing fast | Critical problem | Alert + investigate |
(these values depend on throughput — lag must be evaluated TOGETHER with producer rate. Lag 1000 at 100 msg/s = 10 seconds behind; at 1M msg/s = negligible)
When This Matters
- Production monitoring — always monitor lag
- Scaling decisions — when to add consumers
- Health checks — lag is a health indicator
- SLA monitoring — guarantee processing time
Middle Level
How Lag Calculation Works
Lag = LogEndOffset - CommittedOffset
LogEndOffset:
- The last offset written to the partition
- Updated on every producer send
- Stored on the broker
CommittedOffset:
- The last offset committed by the consumer
- Stored in __consumer_offsets topic
- Updated on consumer.commitSync/Async
Lag calculation:
- Computed by external tools (not Kafka itself)
- Kafka broker provides LogEndOffset
- Monitoring tool reads committed offsets
- Lag = the difference
Monitoring Tools
1. Kafka CLI
# All consumer groups
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list
# Details of a specific group
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--describe --group order-service
# Reset offsets (careful!)
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--group order-service --reset-offsets --to-latest --execute
2. Burrow (LinkedIn)
# Burrow HTTP API
GET http://burrow-server:8000/v3/kafka/cluster/consumer/order-service/status
Response:
{
"error": false,
"message": "consumer status returned",
"status": {
"cluster": "production",
"group": "order-service",
"status": "OK",
"complete": 1.0,
"partitions": [
{
"topic": "orders",
"partition": 0,
"status": "OK",
"start": {"offset": 700, "timestamp": 1712000000000},
"end": {"offset": 1000, "timestamp": 1712000100000},
"current_lag": 200,
"max_lag": 300
}
],
"partition_count": 3
}
}
Burrow status values: | Status | Meaning | | ——– | ———————— | | OK | Lag is decreasing or 0 | | WARN | Lag is stable, high | | ERROR | Lag is growing | | STOP | Consumer is not committing | | STALL | Lag is not changing | | REWIND | Offset moved backwards |
3. Prometheus + kafka_exporter
# docker-compose.yml
services:
kafka-exporter:
image: danielqsj/kafka-exporter
command:
- '--kafka.server=kafka:9092'
ports:
- "9308:9308"
prometheus:
image: prom/prometheus
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
grafana:
image: grafana/grafana
ports:
- "3000:3000"
# prometheus.yml
scrape_configs:
- job_name: 'kafka'
static_configs:
- targets: ['kafka-exporter:9308']
4. Grafana Dashboard
Key panels:
- Consumer lag per partition (time series)
- Total lag across partitions (single stat)
- Lag rate (derivative — lag growing speed)
- Consumer throughput (messages/sec)
- Producer vs Consumer throughput comparison
Common Errors Table
| Error | Symptoms | Consequences | Solution |
|---|---|---|---|
| Monitoring only total lag | Problematic partitions not visible | One partition lags, others OK — problem is hidden | Lag per partition |
| Lag without context | Lag = 500 — is this good or bad? | Wrong decisions | Lag + throughput + trend |
| Without alerting | Lag is growing, nobody sees | Consumer can’t keep up, data delay | Alert thresholds |
| Lag = 0 = OK assumption | Lag = 0 but consumer is stopped | False sense of security | Monitor consumer status |
| Only current lag | No history | Trend analysis is impossible | Store lag history |
| Monitoring on producer side | Lag is a consumer metric, not producer | Irrelevant data | Monitor consumer lag |
Usage Scenarios
Scenario 1: Lag grows linearly
Time Lag
10:00 100
10:05 300
10:10 500
10:15 700
→ Linear growth = consumer slower than producer
Root cause:
- Consumer processing slow (DB bottleneck, external API)
- GC pauses
- Network latency to broker
Action: Scale consumers or optimize processing
Scenario 2: Lag spike then recovery
Time Lag
10:00 100
10:05 5000 ← spike
10:10 3000
10:15 500 ← recovery
10:20 100
Root cause:
- Temporary broker unavailability
- Network partition resolved
- GC pause ended
Action: Investigate spike cause, may not need scaling
Scenario 3: Lag grows exponentially
Time Lag
10:00 100
10:05 200
10:10 800
10:15 3200
10:20 12800
→ Exponential = consumer completely stuck
Root cause:
- Poison pill message
- Deadlock in consumer
- External dependency down
Action: URGENT — find root cause, may need to skip messages
When NOT to Use Lag as the Only Metric
- Lag = 0 but consumer is processing slowly — may spike on burst
- Lag is high but decreasing — consumer is catching up, don’t panic
- Lag without producer rate — lag may be high with low producer activity
- Lag without error rate — lag may grow due to errors, not slow processing
Senior Level
Deep Internals
Lag Calculation Internals
Kafka broker:
Partition State:
- LogEndOffset: offset the next message will get
- HighWatermark: offset of the last committed message (all ISR)
- LastStableOffset: for transactional messages
Consumer:
CommittedOffset: stored in __consumer_offsets topic
Position: next offset to be read (may be ahead of committed)
Lag (per partition) = LogEndOffset - CommittedOffset
__consumer_offsets topic:
- Internal Kafka topic
- Compacted cleanup (only latest offset per group+partition)
- 50 partitions (default)
- Key: [group.id, topic, partition]
- Value: {offset, metadata, leaderEpoch, timestamp}
Offset Commit Lag vs Processing Lag
Offset Commit Lag:
CommittedOffset = 800
ProcessingOffset = 850 (processed but not committed)
LogEndOffset = 1000
"Visible lag" = 1000 - 800 = 200
"Real lag" = 1000 - 850 = 150
Difference = 50 messages in-flight (processed, not committed)
This matters for monitoring:
- Burrow sees committed offset → lag = 200
- Real processing lag = 150
- In-flight messages are not visible to external tools
Kafka Exporter Internals
kafka_exporter (Prometheus):
1. AdminClient.ListConsumerGroupOffsets() → committed offsets
2. AdminClient.ListOffsets(latest) → log end offsets
3. Calculates lag
4. Exposes as Prometheus metrics
Metrics exposed:
kafka_consumergroup_lag{group="order-service",topic="orders",partition="0"}
kafka_consumergroup_lag_sum{group="order-service",topic="orders"}
kafka_topic_partition_current_offset{topic="orders",partition="0"}
kafka_consumergroup_current_offset{group="order-service",topic="orders",partition="0"}
Scrape interval: 10–30s (not more often — expensive admin calls)
Trade-offs
| Approach | Accuracy | Overhead | Real-time | Complexity |
|---|---|---|---|---|
| CLI (kafka-consumer-groups) | High | Manual | Point-in-time | Low |
| Burrow | High | Low | ~5s delay | Medium |
| kafka_exporter + Prometheus | Medium | Low | ~10s delay | Medium |
| JMX (kafka.consumer) | High | Medium | Real-time | High |
| Custom monitoring | Maximum | Varies | Real-time | High |
Edge Cases
1. Lag = 0 but consumer is not working:
Scenario:
Consumer processed up to offset 1000
No new messages produced for the last 10 min
Lag = LogEndOffset(1000) - CommittedOffset(1000) = 0
Looks OK, but the consumer may be dead!
Resolution:
1. Monitor consumer heartbeat (session.timeout.ms)
2. Monitor processing rate (messages/sec)
3. Monitor "time since last commit"
4. Burrow status: STOP if no heartbeat
2. Lag spike during rebalancing:
Scenario:
10:00: Consumer-1 processes partition 0, lag = 100
10:01: Consumer-2 joins → rebalancing
10:02: Partition 0 assigned to Consumer-2
10:02: Lag spike: 100 → 500 (no processing during rebalance)
10:05: Consumer-2 catches up, lag = 50
This is expected behavior!
- Lag spike during rebalance = normal
- Lag recovery after rebalance = healthy
Monitoring:
Ignore lag spikes < 5 min duration
Alert on lag spike > 10 min without recovery
3. Consumer rewind (lag sudden increase):
Scenario:
Committed offset: 1000
Next poll: committed offset = 500 (rewind!)
Lag suddenly: 500 → 1500
Root causes:
1. Offset reset (auto.offset.reset = earliest)
2. Manual seek (consumer.seek)
3. Offset data loss (__consumer_offsets corruption)
4. Consumer group reset
Burrow status: REWIND
Action:
Investigate — rewind may be intentional (replay)
Or data loss (critical!)
4. Transactional Messages and Lag:
Scenario:
Producer started a transaction
Wrote messages at offsets 1000-1050
Transaction NOT committed yet
LogEndOffset = 1051
LastStableOffset = 1000
Consumer lag calculation:
If consumer isolation.level = read_committed:
Sees only committed messages up to offset 1000
Lag = 1000 - committed_offset (correct)
If consumer isolation.level = read_uncommitted:
Sees all messages including uncommitted
Lag = 1051 - committed_offset (misleading!)
Resolution: monitoring should use LastStableOffset
for read_committed consumers
5. Multi-cluster Lag:
Scenario:
MirrorMaker 2 replicates: Cluster A → Cluster B
Consumer reads from Cluster B
Lag on Cluster B may grow due to:
1. Consumer slow (normal lag)
2. Replication lag (Cluster A → Cluster B slow)
3. Producer slow on Cluster A
Need to monitor:
- Consumer lag on Cluster B
- Replication lag (A → B)
- Producer lag on Cluster A
Only then can you determine the root cause
Performance (Production Numbers)
| Metric | Normal | Warning | Critical |
|---|---|---|---|
| Lag per partition | < 1000 | 1000–10000 | > 10000 |
| Lag growth rate | < 100/min | 100–1000/min | > 1000/min |
| Consumer throughput | > 90% of producer | 50–90% | < 50% |
| Commit latency (P99) | < 50ms | 50–200ms | > 200ms |
| Time to recover lag | < 5 min | 5–30 min | > 30 min |
Production War Story
Situation: Delivery platform, 500K events/min, 20 partitions, 10 consumers.
Problem: At 2 AM, lag started growing: 1000 → 50000 in 30 minutes. By 3 AM — 200000 lag. On-call engineer received an alert and started investigation. By 4 AM — 500000 lag, 1 hour backlog.
Investigation timeline:
02:00 — Alert: lag > 10000
02:05 — On-call checks Grafana: lag growing exponentially
02:10 — Consumer throughput dropped: 50K → 5K msg/min
02:15 — No consumer restarts, no rebalancing
02:20 — DB connection pool exhausted (root cause!)
02:25 — DB team: migration running, connection limit reached
02:30 — Migration stopped, connections freed
02:35 — Consumers catch up
03:30 — Lag back to 0
Root cause: DB team ran a migration without notification. The migration used all connections. Consumers couldn’t write to the DB → processing stopped → lag grew.
Lessons:
- Lag alert fired — GOOD
- But lag didn’t say WHY — DB connection monitoring was needed
- Cross-team communication process needed
- Lag growth rate alert (> 10K/min) — more actionable
- Runbook for lag investigation — critical
Post-mortem improvements:
Alerts added:
- lag > 10K for 5m → warning
- lag > 100K for 2m → critical
- lag growth rate > 10K/min → critical
- consumer throughput < 50% of producer → warning
- DB connection pool > 90% → critical
Runbook created:
1. Check lag per partition (is it all or one?)
2. Check consumer throughput vs producer rate
3. Check consumer logs for errors
4. Check external dependencies (DB, API)
5. Check recent deployments
Monitoring (JMX, Prometheus, Burrow)
JMX Metrics
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=consumer-1
- records-lag-max: max lag across all partitions
- records-lag-avg: average lag
- fetch-latency-avg: time to fetch from broker
- fetch-rate: fetch requests per second
kafka.consumer:type=consumer-coordinator-metrics,client-id=consumer-1
- commit-latency-avg
- commit-rate
- last-heartbeat-seconds-ago
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
(if partitions are under-replicated → consumer may see stale data)
Prometheus + Grafana Dashboard
# Recording rules
- record: kafka_consumer_lag_rate
expr: deriv(kafka_consumergroup_lag[5m])
- record: kafka_consumer_throughput
expr: rate(kafka_consumergroup_current_offset[5m])
- record: kafka_producer_throughput
expr: rate(kafka_topic_partition_current_offset[5m])
# Alerts
- alert: KafkaConsumerLagHigh
expr: kafka_consumergroup_lag_sum > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "Consumer lag is high (> 10K)"
- alert: KafkaConsumerLagGrowing
expr: kafka_consumer_lag_rate > 1000
for: 5m
labels:
severity: critical
annotations:
summary: "Consumer lag growing at /min"
- alert: KafkaConsumerThroughputLow
expr: kafka_consumer_throughput / kafka_producer_throughput < 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "Consumer throughput < 50% of producer throughput"
Burrow Configuration
# burrow.ini
[consumer.order-service]
class-name=kafka
cluster=production
group=order-service
# Burrow evaluation
[lagcheck.order-service]
group-order-service.orders.0=OK
group-order-service.orders.1=OK
group-order-service.orders.2=OK
# Notification
[notifier.default]
class-name=email
threshold=1
send-close=true
template=lag-email.tmpl
# HTTP endpoint
[httpserver.default]
address=:8000
Burrow evaluation statuses:
OK: Lag decreasing or 0
WARN: Lag stable but above threshold
ERROR: Lag increasing
STOP: Consumer stopped committing
STALL: Lag not changing (consumer stuck)
REWIND: Offset moved backwards
Highload Best Practices
✅ Lag per partition monitoring (not only total)
✅ Lag growth rate monitoring (deriv/lag rate)
✅ Lag + throughput + producer rate together
✅ Burrow status monitoring (OK/WARN/ERROR/STOP/STALL)
✅ Alert thresholds by severity (warning, critical)
✅ Runbook for lag investigation
✅ Auto-scaling on lag thresholds (KEDA, custom HPA)
✅ Historical lag data (trend analysis)
✅ Multi-cluster lag (replication lag)
✅ Transactional message awareness (LastStableOffset)
❌ Lag without context (throughput, producer rate)
❌ Lag without alerting (useless metric)
❌ Total lag without per-partition breakdown
❌ Lag monitoring without a runbook
❌ Lag without historical data (no trend analysis)
❌ Lag on production without auto-scaling plan
❌ Lag monitoring without consumer status (dead consumer, lag=0)
Auto-scaling Based on Lag (KEDA)
# KEDA ScaledObject for auto-scaling Kafka consumers
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: kafka-consumer-scaler
spec:
scaleTargetRef:
name: order-consumer-deployment
minReplicaCount: 3
maxReplicaCount: 20
triggers:
- type: kafka
metadata:
bootstrapServers: kafka:9092
consumerGroup: order-service
topic: orders
lagThreshold: "5000" # scale up when lag > 5000
activationLagThreshold: "1000" # activate when lag > 1000
offsetRestPolicy: "latest"
allowIdleConsumers: "false"
Architectural Decisions
- Lag is a leading indicator — grows before users are impacted
- Lag + throughput + error rate — three pillars of consumer health
- Burrow status > lag number — status trend is more actionable than absolute lag
- Auto-scaling on lag — proactive scaling, not reactive
- Lag per partition — partition-level visibility is critical
Summary for Senior
- Lag = LogEndOffset - CommittedOffset, with nuances (in-flight, transactional)
- Burrow status (OK/WARN/ERROR/STOP/STALL/REWIND) is more informative than lag value
- Lag growth rate is more actionable than absolute lag
- Lag spike during rebalancing is normal, don’t alert
- Lag = 0 doesn’t mean consumer is healthy (check heartbeat, throughput)
- Transactional messages — use LastStableOffset for lag calculation
- Auto-scaling on lag thresholds — KEDA for Kubernetes
- Production runbook for lag investigation — mandatory
- Lag per partition — total lag hides partition-level issues
- Multi-cluster: consumer lag + replication lag + producer lag — all three are needed
🎯 Interview Cheat Sheet
Must know:
- Consumer Lag = LogEndOffset - CommittedOffset — key consumer health metric
- Lag per partition is critical: total lag hides problematic partitions (hot key)
- Burrow status: OK, WARN, ERROR, STOP, STALL, REWIND — more informative than absolute lag
- Lag growth rate (deriv) — more actionable than absolute lag value
- Lag = 0 doesn’t mean healthy: consumer may be stopped, no new messages
- Tools: kafka-consumer-groups.sh, Burrow, Prometheus + kafka_exporter, Grafana
- Auto-scaling on lag thresholds: KEDA for Kubernetes
Common follow-up questions:
- Lag=0 — is this always OK? — No, consumer may be dead; check heartbeat and throughput.
- What does REWIND mean in Burrow? — Offset moved backwards (replay, reset, or data loss).
- Lag spike during rebalancing — is this a problem? — No, expected behavior; ignore < 5 min duration.
- How to identify a hot partition? — Lag imbalance across partitions: one lags, others are OK.
Red flags (DO NOT say):
- “Lag=0 means everything is fine” — consumer may be stopped
- “Only total lag is enough” — hides partition-level problems
- “Lag spike during rebalance is a critical problem” — normal behavior
- “Burrow is not needed, CLI is enough” — CLI = point-in-time, Burrow = trend analysis
Related topics:
- [[12. What is offset in Kafka]]
- [[13. How does offset commit work]]
- [[24. How to handle errors when reading messages]]
- [[25. What is DLQ (Dead Letter Queue)]]