How to monitor consumer lag in Kafka

Junior Level

Simple Definition

Consumer lag is the difference between the latest message in a Kafka partition and the last message the consumer processed. It shows how far behind the consumer is.

Analogy

Imagine a checkout line at a store. The cashier (consumer) has served 80 customers, and there are still 200 waiting in line. Lag = 200 — how many customers are still waiting. If the lag is growing — the cashier can’t keep up, you need to open another register.

Visualization

Partition 0:
  Log End Offset (latest in Kafka):  1000
  Committed Offset (processed):       800
  ────────────────────────────────────────
  Consumer Lag:                        200

  [0......100......200......800|←200→|1000]
  |----- processed -----| lag |

How to Check Lag

# Kafka CLI utility
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group order-service

GROUP           TOPIC    PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
order-service   orders   0          800             1000            200
order-service   orders   1          750             900             150
order-service   orders   2          900             950              50

Total lag: 400 messages

What Lag Means

Lag	What It Means	What to Do
0	Consumer processed everything	Nothing, all good
1–100	Normal backlog	Monitor
100–1000	Consumer is slowing down	Check throughput
1000+	Consumer can’t keep up	Scale up
Growing fast	Critical problem	Alert + investigate

(these values depend on throughput — lag must be evaluated TOGETHER with producer rate. Lag 1000 at 100 msg/s = 10 seconds behind; at 1M msg/s = negligible)

When This Matters

Production monitoring — always monitor lag
Scaling decisions — when to add consumers
Health checks — lag is a health indicator
SLA monitoring — guarantee processing time

Middle Level

How Lag Calculation Works

Lag = LogEndOffset - CommittedOffset

LogEndOffset:
  - The last offset written to the partition
  - Updated on every producer send
  - Stored on the broker

CommittedOffset:
  - The last offset committed by the consumer
  - Stored in __consumer_offsets topic
  - Updated on consumer.commitSync/Async

Lag calculation:
  - Computed by external tools (not Kafka itself)
  - Kafka broker provides LogEndOffset
  - Monitoring tool reads committed offsets
  - Lag = the difference

Monitoring Tools

1. Kafka CLI

# All consumer groups
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list

# Details of a specific group
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group order-service

# Reset offsets (careful!)
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --group order-service --reset-offsets --to-latest --execute

2. Burrow (LinkedIn)

# Burrow HTTP API
GET http://burrow-server:8000/v3/kafka/cluster/consumer/order-service/status

Response:
{
  "error": false,
  "message": "consumer status returned",
  "status": {
    "cluster": "production",
    "group": "order-service",
    "status": "OK",
    "complete": 1.0,
    "partitions": [
      {
        "topic": "orders",
        "partition": 0,
        "status": "OK",
        "start": {"offset": 700, "timestamp": 1712000000000},
        "end": {"offset": 1000, "timestamp": 1712000100000},
        "current_lag": 200,
        "max_lag": 300
      }
    ],
    "partition_count": 3
  }
}

3. Prometheus + kafka_exporter

# docker-compose.yml
services:
  kafka-exporter:
    image: danielqsj/kafka-exporter
    command:
      - '--kafka.server=kafka:9092'
    ports:
      - "9308:9308"

  prometheus:
    image: prom/prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"

# prometheus.yml
scrape_configs:
  - job_name: 'kafka'
    static_configs:
      - targets: ['kafka-exporter:9308']

4. Grafana Dashboard

Key panels:

Consumer lag per partition (time series)
Total lag across partitions (single stat)
Lag rate (derivative — lag growing speed)
Consumer throughput (messages/sec)
Producer vs Consumer throughput comparison

Common Errors Table

Error	Symptoms	Consequences	Solution
Monitoring only total lag	Problematic partitions not visible	One partition lags, others OK — problem is hidden	Lag per partition
Lag without context	Lag = 500 — is this good or bad?	Wrong decisions	Lag + throughput + trend
Without alerting	Lag is growing, nobody sees	Consumer can’t keep up, data delay	Alert thresholds
Lag = 0 = OK assumption	Lag = 0 but consumer is stopped	False sense of security	Monitor consumer status
Only current lag	No history	Trend analysis is impossible	Store lag history
Monitoring on producer side	Lag is a consumer metric, not producer	Irrelevant data	Monitor consumer lag

Usage Scenarios

Scenario 1: Lag grows linearly

Time     Lag
10:00    100
10:05    300
10:10    500
10:15    700
→ Linear growth = consumer slower than producer

Root cause:
  - Consumer processing slow (DB bottleneck, external API)
  - GC pauses
  - Network latency to broker

Action: Scale consumers or optimize processing

Scenario 2: Lag spike then recovery

Time     Lag
10:00    100
10:05    5000   ← spike
10:10    3000
10:15    500    ← recovery
10:20    100

Root cause:
  - Temporary broker unavailability
  - Network partition resolved
  - GC pause ended

Action: Investigate spike cause, may not need scaling

Scenario 3: Lag grows exponentially

Time     Lag
10:00    100
10:05    200
10:10    800
10:15    3200
10:20    12800
→ Exponential = consumer completely stuck

Root cause:
  - Poison pill message
  - Deadlock in consumer
  - External dependency down

Action: URGENT — find root cause, may need to skip messages

When NOT to Use Lag as the Only Metric

Lag = 0 but consumer is processing slowly — may spike on burst
Lag is high but decreasing — consumer is catching up, don’t panic
Lag without producer rate — lag may be high with low producer activity
Lag without error rate — lag may grow due to errors, not slow processing

Senior Level

Deep Internals

Lag Calculation Internals

Kafka broker:
  Partition State:
    - LogEndOffset: offset the next message will get
    - HighWatermark: offset of the last committed message (all ISR)
    - LastStableOffset: for transactional messages

Consumer:
  CommittedOffset: stored in __consumer_offsets topic
  Position: next offset to be read (may be ahead of committed)

Lag (per partition) = LogEndOffset - CommittedOffset

__consumer_offsets topic:
  - Internal Kafka topic
  - Compacted cleanup (only latest offset per group+partition)
  - 50 partitions (default)
  - Key: [group.id, topic, partition]
  - Value: {offset, metadata, leaderEpoch, timestamp}

Offset Commit Lag vs Processing Lag

Offset Commit Lag:
  CommittedOffset = 800
  ProcessingOffset = 850  (processed but not committed)
  LogEndOffset = 1000

  "Visible lag" = 1000 - 800 = 200
  "Real lag" = 1000 - 850 = 150

  Difference = 50 messages in-flight (processed, not committed)

This matters for monitoring:
  - Burrow sees committed offset → lag = 200
  - Real processing lag = 150
  - In-flight messages are not visible to external tools

Kafka Exporter Internals

kafka_exporter (Prometheus):
  1. AdminClient.ListConsumerGroupOffsets() → committed offsets
  2. AdminClient.ListOffsets(latest) → log end offsets
  3. Calculates lag
  4. Exposes as Prometheus metrics

Metrics exposed:
  kafka_consumergroup_lag{group="order-service",topic="orders",partition="0"}
  kafka_consumergroup_lag_sum{group="order-service",topic="orders"}
  kafka_topic_partition_current_offset{topic="orders",partition="0"}
  kafka_consumergroup_current_offset{group="order-service",topic="orders",partition="0"}

Scrape interval: 10–30s (not more often — expensive admin calls)

Trade-offs

Approach	Accuracy	Overhead	Real-time	Complexity
CLI (kafka-consumer-groups)	High	Manual	Point-in-time	Low
Burrow	High	Low	~5s delay	Medium
kafka_exporter + Prometheus	Medium	Low	~10s delay	Medium
JMX (kafka.consumer)	High	Medium	Real-time	High
Custom monitoring	Maximum	Varies	Real-time	High

Edge Cases

1. Lag = 0 but consumer is not working:

Scenario:
  Consumer processed up to offset 1000
  No new messages produced for the last 10 min
  Lag = LogEndOffset(1000) - CommittedOffset(1000) = 0

  Looks OK, but the consumer may be dead!

Resolution:
  1. Monitor consumer heartbeat (session.timeout.ms)
  2. Monitor processing rate (messages/sec)
  3. Monitor "time since last commit"
  4. Burrow status: STOP if no heartbeat

2. Lag spike during rebalancing:

Scenario:
  10:00: Consumer-1 processes partition 0, lag = 100
  10:01: Consumer-2 joins → rebalancing
  10:02: Partition 0 assigned to Consumer-2
  10:02: Lag spike: 100 → 500 (no processing during rebalance)
  10:05: Consumer-2 catches up, lag = 50

This is expected behavior!
  - Lag spike during rebalance = normal
  - Lag recovery after rebalance = healthy

Monitoring:
  Ignore lag spikes < 5 min duration
  Alert on lag spike > 10 min without recovery

3. Consumer rewind (lag sudden increase):

Scenario:
  Committed offset: 1000
  Next poll: committed offset = 500 (rewind!)
  Lag suddenly: 500 → 1500

Root causes:
  1. Offset reset (auto.offset.reset = earliest)
  2. Manual seek (consumer.seek)
  3. Offset data loss (__consumer_offsets corruption)
  4. Consumer group reset

Burrow status: REWIND
Action:
  Investigate — rewind may be intentional (replay)
  Or data loss (critical!)

4. Transactional Messages and Lag:

Scenario:
  Producer started a transaction
  Wrote messages at offsets 1000-1050
  Transaction NOT committed yet
  LogEndOffset = 1051
  LastStableOffset = 1000

Consumer lag calculation:
  If consumer isolation.level = read_committed:
    Sees only committed messages up to offset 1000
    Lag = 1000 - committed_offset  (correct)
  If consumer isolation.level = read_uncommitted:
    Sees all messages including uncommitted
    Lag = 1051 - committed_offset  (misleading!)

Resolution: monitoring should use LastStableOffset
         for read_committed consumers

5. Multi-cluster Lag:

Scenario:
  MirrorMaker 2 replicates: Cluster A → Cluster B
  Consumer reads from Cluster B

  Lag on Cluster B may grow due to:
  1. Consumer slow (normal lag)
  2. Replication lag (Cluster A → Cluster B slow)
  3. Producer slow on Cluster A

Need to monitor:
  - Consumer lag on Cluster B
  - Replication lag (A → B)
  - Producer lag on Cluster A

Only then can you determine the root cause

Performance (Production Numbers)

Metric	Normal	Warning	Critical
Lag per partition	< 1000	1000–10000	> 10000
Lag growth rate	< 100/min	100–1000/min	> 1000/min
Consumer throughput	> 90% of producer	50–90%	< 50%
Commit latency (P99)	< 50ms	50–200ms	> 200ms
Time to recover lag	< 5 min	5–30 min	> 30 min

Production War Story

Situation: Delivery platform, 500K events/min, 20 partitions, 10 consumers.

Problem: At 2 AM, lag started growing: 1000 → 50000 in 30 minutes. By 3 AM — 200000 lag. On-call engineer received an alert and started investigation. By 4 AM — 500000 lag, 1 hour backlog.

Investigation timeline:

00 — Alert: lag > 10000
05 — On-call checks Grafana: lag growing exponentially
10 — Consumer throughput dropped: 50K → 5K msg/min
15 — No consumer restarts, no rebalancing
20 — DB connection pool exhausted (root cause!)
25 — DB team: migration running, connection limit reached
30 — Migration stopped, connections freed
35 — Consumers catch up
30 — Lag back to 0

Root cause: DB team ran a migration without notification. The migration used all connections. Consumers couldn’t write to the DB → processing stopped → lag grew.

Lessons:

Lag alert fired — GOOD
But lag didn’t say WHY — DB connection monitoring was needed
Cross-team communication process needed
Lag growth rate alert (> 10K/min) — more actionable
Runbook for lag investigation — critical

Post-mortem improvements:

Alerts added:
  - lag > 10K for 5m → warning
  - lag > 100K for 2m → critical
  - lag growth rate > 10K/min → critical
  - consumer throughput < 50% of producer → warning
  - DB connection pool > 90% → critical

Runbook created:
  1. Check lag per partition (is it all or one?)
  2. Check consumer throughput vs producer rate
  3. Check consumer logs for errors
  4. Check external dependencies (DB, API)
  5. Check recent deployments

Monitoring (JMX, Prometheus, Burrow)

JMX Metrics

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=consumer-1
  - records-lag-max: max lag across all partitions
  - records-lag-avg: average lag
  - fetch-latency-avg: time to fetch from broker
  - fetch-rate: fetch requests per second

kafka.consumer:type=consumer-coordinator-metrics,client-id=consumer-1
  - commit-latency-avg
  - commit-rate
  - last-heartbeat-seconds-ago

kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
  (if partitions are under-replicated → consumer may see stale data)

Prometheus + Grafana Dashboard

# Recording rules
- record: kafka_consumer_lag_rate
  expr: deriv(kafka_consumergroup_lag[5m])

- record: kafka_consumer_throughput
  expr: rate(kafka_consumergroup_current_offset[5m])

- record: kafka_producer_throughput
  expr: rate(kafka_topic_partition_current_offset[5m])

# Alerts
- alert: KafkaConsumerLagHigh
  expr: kafka_consumergroup_lag_sum > 10000
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Consumer lag is high (> 10K)"

- alert: KafkaConsumerLagGrowing
  expr: kafka_consumer_lag_rate > 1000
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Consumer lag growing at /min"

- alert: KafkaConsumerThroughputLow
  expr: kafka_consumer_throughput / kafka_producer_throughput < 0.5
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Consumer throughput < 50% of producer throughput"

Burrow Configuration

# burrow.ini
[consumer.order-service]
class-name=kafka
cluster=production
group=order-service

# Burrow evaluation
[lagcheck.order-service]
group-order-service.orders.0=OK
group-order-service.orders.1=OK
group-order-service.orders.2=OK

# Notification
[notifier.default]
class-name=email
threshold=1
send-close=true
template=lag-email.tmpl

# HTTP endpoint
[httpserver.default]
address=:8000

Burrow evaluation statuses:

OK:      Lag decreasing or 0
WARN:    Lag stable but above threshold
ERROR:   Lag increasing
STOP:    Consumer stopped committing
STALL:   Lag not changing (consumer stuck)
REWIND:  Offset moved backwards

Highload Best Practices

✅ Lag per partition monitoring (not only total)
✅ Lag growth rate monitoring (deriv/lag rate)
✅ Lag + throughput + producer rate together
✅ Burrow status monitoring (OK/WARN/ERROR/STOP/STALL)
✅ Alert thresholds by severity (warning, critical)
✅ Runbook for lag investigation
✅ Auto-scaling on lag thresholds (KEDA, custom HPA)
✅ Historical lag data (trend analysis)
✅ Multi-cluster lag (replication lag)
✅ Transactional message awareness (LastStableOffset)

❌ Lag without context (throughput, producer rate)
❌ Lag without alerting (useless metric)
❌ Total lag without per-partition breakdown
❌ Lag monitoring without a runbook
❌ Lag without historical data (no trend analysis)
❌ Lag on production without auto-scaling plan
❌ Lag monitoring without consumer status (dead consumer, lag=0)

Auto-scaling Based on Lag (KEDA)

# KEDA ScaledObject for auto-scaling Kafka consumers
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: kafka-consumer-scaler
spec:
  scaleTargetRef:
    name: order-consumer-deployment
  minReplicaCount: 3
  maxReplicaCount: 20
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka:9092
        consumerGroup: order-service
        topic: orders
        lagThreshold: "5000"    # scale up when lag > 5000
        activationLagThreshold: "1000"  # activate when lag > 1000
        offsetRestPolicy: "latest"
        allowIdleConsumers: "false"

Architectural Decisions

Lag is a leading indicator — grows before users are impacted
Lag + throughput + error rate — three pillars of consumer health
Burrow status > lag number — status trend is more actionable than absolute lag
Auto-scaling on lag — proactive scaling, not reactive
Lag per partition — partition-level visibility is critical

Summary for Senior

Lag = LogEndOffset - CommittedOffset, with nuances (in-flight, transactional)
Burrow status (OK/WARN/ERROR/STOP/STALL/REWIND) is more informative than lag value
Lag growth rate is more actionable than absolute lag
Lag spike during rebalancing is normal, don’t alert
Lag = 0 doesn’t mean consumer is healthy (check heartbeat, throughput)
Transactional messages — use LastStableOffset for lag calculation
Auto-scaling on lag thresholds — KEDA for Kubernetes
Production runbook for lag investigation — mandatory
Lag per partition — total lag hides partition-level issues
Multi-cluster: consumer lag + replication lag + producer lag — all three are needed

🎯 Interview Cheat Sheet

Must know:

Consumer Lag = LogEndOffset - CommittedOffset — key consumer health metric
Lag per partition is critical: total lag hides problematic partitions (hot key)
Burrow status: OK, WARN, ERROR, STOP, STALL, REWIND — more informative than absolute lag
Lag growth rate (deriv) — more actionable than absolute lag value
Lag = 0 doesn’t mean healthy: consumer may be stopped, no new messages
Tools: kafka-consumer-groups.sh, Burrow, Prometheus + kafka_exporter, Grafana
Auto-scaling on lag thresholds: KEDA for Kubernetes

Common follow-up questions:

Lag=0 — is this always OK? — No, consumer may be dead; check heartbeat and throughput.
What does REWIND mean in Burrow? — Offset moved backwards (replay, reset, or data loss).
Lag spike during rebalancing — is this a problem? — No, expected behavior; ignore < 5 min duration.
How to identify a hot partition? — Lag imbalance across partitions: one lags, others are OK.

Red flags (DO NOT say):

“Lag=0 means everything is fine” — consumer may be stopped
“Only total lag is enough” — hides partition-level problems
“Lag spike during rebalance is a critical problem” — normal behavior
“Burrow is not needed, CLI is enough” — CLI = point-in-time, Burrow = trend analysis

Related topics:

[[12. What is offset in Kafka]]
[[13. How does offset commit work]]
[[24. How to handle errors when reading messages]]
[[25. What is DLQ (Dead Letter Queue)]]