What is rebalancing and when does it happen

Junior Level

Definition

Rebalancing is the process of redistributing partitions among consumers in a group.

“Cost”: during rebalance, consumers do NOT read messages. With eager rebalancing — full stop for 5-30 seconds. During this time, lag grows, and with cascading failure — avalanche-style.

Before rebalance:
  C1 → P0, P1
  C2 → P2, P3

After adding C3:
  C1 → P0, P1
  C2 → P2
  C3 → P3

When does rebalancing occur?

A new consumer is added
A consumer crashed or disconnected
New partitions were added to a topic
A consumer changed its topic subscription

What happens during rebalance?

All consumers stop reading
Kafka redistributes partitions
Each consumer gets new partitions
Reading resumes

Important: With eager rebalancing — full stop. With cooperative (Kafka 2.3+) — processing continues on remaining partitions, only moved partitions experience downtime.

Code example

// Consumer automatically participates in rebalance
props.put("group.id", "my-group");
consumer.subscribe(List.of("orders"));
// When a new consumer with the same group.id is added
// an automatic rebalance will occur

Middle Level

Rebalancing process (step by step)

Trigger — event (new consumer, timeout, etc.)
Group coordinator starts rebalance
All consumers revoke their partitions
New assignment is computed
Consumers receive new partitions
Reading resumes

Types of rebalance

Eager Rebalancing (old):

All consumers stop completely
Full partition redistribution
All consumers resume work
→ Full group downtime

Cooperative Rebalancing (Kafka 2.3+):

Gradual partition movement
Consumers continue working with remaining partitions
→ Minimal downtime

ConsumerRebalanceListener

consumer.subscribe(List.of("orders"), new ConsumerRebalanceListener() {
    @Override
    public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
        // Commit offsets before losing partitions
        consumer.commitSync();
        // Clean up local resources
        cleanup(partitions);
    }

    @Override
    public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
        // Initialize new partitions
        initialize(partitions);
    }
});

Max Poll Interval Exceeded

// If batch processing takes too long
props.put("max.poll.interval.ms", "300000");  // 5 minutes

// If consumer doesn't call poll() within this time
// → considered dead → excluded from group → rebalance

Common mistakes

Frequent rebalances:

Max poll interval exceeded → consumer excluded → rebalance
→ Downtime → performance loss

Without ConsumerRebalanceListener:

Data loss on rebalance (offsets not committed)

Too short timeout:

session.timeout.ms=5000 → false positives
→ Consumer considered dead → unnecessary rebalance

Senior Level

Internal Implementation

Group Coordinator:

Coordinator broker stores group state:
- Member list (list of consumers)
- Partition assignments
- Committed offsets (in __consumer_offsets)

Rebalancing protocol:

Glossary:
  Group Coordinator — broker managing group membership.
  JoinGroup — consumer says "I want to join the group".
  SyncGroup — leader distributes the assignment result.

1. Join Phase:
   - Consumer → JoinGroup request
   - Coordinator → collects all members
   - Coordinator → chooses group leader

2. Assignment Phase:
   - Leader → computes assignment
   - Leader → SyncGroup request
   - Coordinator → distributes assignments

3. Heartbeat Phase:
   - Consumers → Heartbeat requests
   - Coordinator → confirms alive

Cooperative Rebalancing — in detail

Incremental Cooperative Rebalancing:

Instead of full stop:
1. Revoke some partitions
2. Assign new partitions
3. Continue working
4. Repeat if needed

Advantages:
- Minimal downtime
- Preserve data locality
- Better availability

Configuration:

props.put("partition.assignment.strategy",
    "org.apache.kafka.clients.consumer.CooperativeStickyAssignor");

Tuning Parameters

# Session management
session.timeout.ms: 30000           # 30 seconds until detection
heartbeat.interval.ms: 10000        # 10 seconds (1/3 of session timeout)

# Poll management
max.poll.interval.ms: 300000        # 5 minutes for processing
max.poll.records: 500               # batch for processing

# Assignment strategy
partition.assignment.strategy: cooperative-sticky

Failure Scenarios

1. Network Partition:

Consumer loses connection to broker
→ Heartbeat timeout → session timeout → rebalance
→ When network recovers → consumer reconnect → join group

2. GC Pause:

session.timeout.ms=30000, GC pause=35 seconds.
Consumer doesn't send heartbeat for 35 seconds. Broker waits 30 seconds → excludes →
rebalance. After GC, consumer reconnects → joins group → gets new partitions.

Long GC pause → consumer doesn't send heartbeat
→ Broker considers consumer dead → rebalance
→ After GC, consumer learns about rebalance

3. Slow Processing:

Processing batch > max.poll.interval.ms
→ Broker considers consumer dead → rebalance
→ Consumer continues → receives new assignment

Monitoring Rebalancing

Key metrics:

kafka.consumer.coordinator:num-rebalances
kafka.consumer.coordinator:time-since-last-rebalance
kafka.consumer:assigned-partitions
kafka.consumer:commit-latency-avg

Alerts:

- Rebalance more often than every 10 minutes → warning
- Rebalance more often than every 5 minutes → critical
- Consumer hasn't sent heartbeat > 30s → critical
- Consumer lag growing → warning

Troubleshooting

1. Identify the cause:

// Logs show:
"Heartbeat session expired" → session.timeout.ms too short
"Max poll interval exceeded" → processing too slow
"Member is dead" → consumer crashed

2. Resolving issues:

Heartbeat issues:
  → Increase session.timeout.ms
  → Check network connectivity

Poll interval issues:
  → Reduce max.poll.records
  → Optimize processing
  → Increase max.poll.interval.ms

Best Practices

✅ Cooperative rebalancing
✅ Sufficient session timeout (30s+)
✅ Fast message processing
✅ ConsumerRebalanceListener for graceful handling
✅ Static membership for stable instances
✅ Monitor rebalance frequency

❌ Frequent rebalances without investigation
❌ Eager rebalancing for production
❌ Too short timeout
❌ Without onPartitionsRevoked handling
❌ Ignoring consumer lag

Architectural decisions

Cooperative rebalancing — standard for production
Static membership — for stable instances
Proper tuning — balance between detection time and false positives
Monitoring — early problem detection

Summary for Senior

Rebalancing — critical process for group availability
Cooperative rebalancing minimizes disruption
Group coordinator protocol: join → sync → heartbeat
Parameter tuning is critical for stability
Monitoring rebalance frequency — early problem indicator

🎯 Interview Cheat Sheet

Must know:

Rebalancing — redistribution of partitions among consumers in a group
Triggers: new consumer, crash, session timeout, adding partitions
Eager rebalancing: full stop 5-30 seconds; Cooperative: 1-5 seconds, processing continues
ConsumerRebalanceListener: commitSync in onPartitionsRevoked, cleanup state
Max poll interval exceeded — common cause: consumer doesn’t call poll() in time
Group Coordinator protocol: JoinGroup → Assignment (Leader) → SyncGroup → Heartbeat
Static membership (group.instance.id) — eliminates rebalance on rolling deploy

Common follow-up questions:

What is eager vs cooperative rebalancing? — Eager: all stop. Cooperative: only moved partitions.
What happens with too short session.timeout? — False positives, unnecessary rebalances.
How to avoid rebalance on GC pause? — Increase session.timeout.ms, static membership.
Who computes the assignment? — Leader consumer (eager) or Group Coordinator (KIP-848, Kafka 3.0+).

Red flags (DO NOT say):

“Rebalancing is fast and painless” — 5-30 seconds downtime (eager)
“You can ignore frequent rebalances” — sign of timeout or processing problems
“Eager rebalancing is best for production” — cooperative minimizes disruption
“Rebalancing doesn’t affect processing” — full stop with eager

Related topics:

[[5. What is Consumer Group]]
[[6. How does consumer balancing work in a group]]
[[8. What happens when you add a new consumer to the group]]
[[14. What is the difference between auto commit and manual commit]]