What is rebalancing and when does it happen
on remaining partitions, only moved partitions experience downtime.
Junior Level
Definition
Rebalancing is the process of redistributing partitions among consumers in a group.
“Cost”: during rebalance, consumers do NOT read messages. With eager rebalancing — full stop for 5-30 seconds. During this time, lag grows, and with cascading failure — avalanche-style.
Before rebalance:
C1 → P0, P1
C2 → P2, P3
After adding C3:
C1 → P0, P1
C2 → P2
C3 → P3
When does rebalancing occur?
1. A new consumer is added
2. A consumer crashed or disconnected
3. New partitions were added to a topic
4. A consumer changed its topic subscription
What happens during rebalance?
1. All consumers stop reading
2. Kafka redistributes partitions
3. Each consumer gets new partitions
4. Reading resumes
Important: With eager rebalancing — full stop. With cooperative (Kafka 2.3+) — processing continues on remaining partitions, only moved partitions experience downtime.
Code example
// Consumer automatically participates in rebalance
props.put("group.id", "my-group");
consumer.subscribe(List.of("orders"));
// When a new consumer with the same group.id is added
// an automatic rebalance will occur
Middle Level
Rebalancing process (step by step)
1. Trigger — event (new consumer, timeout, etc.)
2. Group coordinator starts rebalance
3. All consumers revoke their partitions
4. New assignment is computed
5. Consumers receive new partitions
6. Reading resumes
Types of rebalance
Eager Rebalancing (old):
All consumers stop completely
Full partition redistribution
All consumers resume work
→ Full group downtime
Cooperative Rebalancing (Kafka 2.3+):
Gradual partition movement
Consumers continue working with remaining partitions
→ Minimal downtime
ConsumerRebalanceListener
consumer.subscribe(List.of("orders"), new ConsumerRebalanceListener() {
@Override
public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
// Commit offsets before losing partitions
consumer.commitSync();
// Clean up local resources
cleanup(partitions);
}
@Override
public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
// Initialize new partitions
initialize(partitions);
}
});
Max Poll Interval Exceeded
// If batch processing takes too long
props.put("max.poll.interval.ms", "300000"); // 5 minutes
// If consumer doesn't call poll() within this time
// → considered dead → excluded from group → rebalance
Common mistakes
- Frequent rebalances:
Max poll interval exceeded → consumer excluded → rebalance → Downtime → performance loss - Without ConsumerRebalanceListener:
Data loss on rebalance (offsets not committed) - Too short timeout:
session.timeout.ms=5000 → false positives → Consumer considered dead → unnecessary rebalance
Senior Level
Internal Implementation
Group Coordinator:
Coordinator broker stores group state:
- Member list (list of consumers)
- Partition assignments
- Committed offsets (in __consumer_offsets)
Rebalancing protocol:
Glossary:
Group Coordinator — broker managing group membership.
JoinGroup — consumer says "I want to join the group".
SyncGroup — leader distributes the assignment result.
1. Join Phase:
- Consumer → JoinGroup request
- Coordinator → collects all members
- Coordinator → chooses group leader
2. Assignment Phase:
- Leader → computes assignment
- Leader → SyncGroup request
- Coordinator → distributes assignments
3. Heartbeat Phase:
- Consumers → Heartbeat requests
- Coordinator → confirms alive
Cooperative Rebalancing — in detail
Incremental Cooperative Rebalancing:
Instead of full stop:
1. Revoke some partitions
2. Assign new partitions
3. Continue working
4. Repeat if needed
Advantages:
- Minimal downtime
- Preserve data locality
- Better availability
Configuration:
props.put("partition.assignment.strategy",
"org.apache.kafka.clients.consumer.CooperativeStickyAssignor");
Tuning Parameters
# Session management
session.timeout.ms: 30000 # 30 seconds until detection
heartbeat.interval.ms: 10000 # 10 seconds (1/3 of session timeout)
# Poll management
max.poll.interval.ms: 300000 # 5 minutes for processing
max.poll.records: 500 # batch for processing
# Assignment strategy
partition.assignment.strategy: cooperative-sticky
Failure Scenarios
1. Network Partition:
Consumer loses connection to broker
→ Heartbeat timeout → session timeout → rebalance
→ When network recovers → consumer reconnect → join group
2. GC Pause:
session.timeout.ms=30000, GC pause=35 seconds.
Consumer doesn't send heartbeat for 35 seconds. Broker waits 30 seconds → excludes →
rebalance. After GC, consumer reconnects → joins group → gets new partitions.
Long GC pause → consumer doesn't send heartbeat
→ Broker considers consumer dead → rebalance
→ After GC, consumer learns about rebalance
3. Slow Processing:
Processing batch > max.poll.interval.ms
→ Broker considers consumer dead → rebalance
→ Consumer continues → receives new assignment
Monitoring Rebalancing
Key metrics:
kafka.consumer.coordinator:num-rebalances
kafka.consumer.coordinator:time-since-last-rebalance
kafka.consumer:assigned-partitions
kafka.consumer:commit-latency-avg
Alerts:
- Rebalance more often than every 10 minutes → warning
- Rebalance more often than every 5 minutes → critical
- Consumer hasn't sent heartbeat > 30s → critical
- Consumer lag growing → warning
Troubleshooting
1. Identify the cause:
// Logs show:
"Heartbeat session expired" → session.timeout.ms too short
"Max poll interval exceeded" → processing too slow
"Member is dead" → consumer crashed
2. Resolving issues:
Heartbeat issues:
→ Increase session.timeout.ms
→ Check network connectivity
Poll interval issues:
→ Reduce max.poll.records
→ Optimize processing
→ Increase max.poll.interval.ms
Best Practices
✅ Cooperative rebalancing
✅ Sufficient session timeout (30s+)
✅ Fast message processing
✅ ConsumerRebalanceListener for graceful handling
✅ Static membership for stable instances
✅ Monitor rebalance frequency
❌ Frequent rebalances without investigation
❌ Eager rebalancing for production
❌ Too short timeout
❌ Without onPartitionsRevoked handling
❌ Ignoring consumer lag
Architectural decisions
- Cooperative rebalancing — standard for production
- Static membership — for stable instances
- Proper tuning — balance between detection time and false positives
- Monitoring — early problem detection
Summary for Senior
- Rebalancing — critical process for group availability
- Cooperative rebalancing minimizes disruption
- Group coordinator protocol: join → sync → heartbeat
- Parameter tuning is critical for stability
- Monitoring rebalance frequency — early problem indicator
🎯 Interview Cheat Sheet
Must know:
- Rebalancing — redistribution of partitions among consumers in a group
- Triggers: new consumer, crash, session timeout, adding partitions
- Eager rebalancing: full stop 5-30 seconds; Cooperative: 1-5 seconds, processing continues
- ConsumerRebalanceListener: commitSync in onPartitionsRevoked, cleanup state
- Max poll interval exceeded — common cause: consumer doesn’t call poll() in time
- Group Coordinator protocol: JoinGroup → Assignment (Leader) → SyncGroup → Heartbeat
- Static membership (
group.instance.id) — eliminates rebalance on rolling deploy
Common follow-up questions:
- What is eager vs cooperative rebalancing? — Eager: all stop. Cooperative: only moved partitions.
- What happens with too short session.timeout? — False positives, unnecessary rebalances.
- How to avoid rebalance on GC pause? — Increase session.timeout.ms, static membership.
- Who computes the assignment? — Leader consumer (eager) or Group Coordinator (KIP-848, Kafka 3.0+).
Red flags (DO NOT say):
- “Rebalancing is fast and painless” — 5-30 seconds downtime (eager)
- “You can ignore frequent rebalances” — sign of timeout or processing problems
- “Eager rebalancing is best for production” — cooperative minimizes disruption
- “Rebalancing doesn’t affect processing” — full stop with eager
Related topics:
- [[5. What is Consumer Group]]
- [[6. How does consumer balancing work in a group]]
- [[8. What happens when you add a new consumer to the group]]
- [[14. What is the difference between auto commit and manual commit]]