Question 15 · Section 15

What is rebalancing and when does it happen

on remaining partitions, only moved partitions experience downtime.

Language versions: English Russian Ukrainian

Junior Level

Definition

Rebalancing is the process of redistributing partitions among consumers in a group.

“Cost”: during rebalance, consumers do NOT read messages. With eager rebalancing — full stop for 5-30 seconds. During this time, lag grows, and with cascading failure — avalanche-style.

Before rebalance:
  C1 → P0, P1
  C2 → P2, P3

After adding C3:
  C1 → P0, P1
  C2 → P2
  C3 → P3

When does rebalancing occur?

1. A new consumer is added
2. A consumer crashed or disconnected
3. New partitions were added to a topic
4. A consumer changed its topic subscription

What happens during rebalance?

1. All consumers stop reading
2. Kafka redistributes partitions
3. Each consumer gets new partitions
4. Reading resumes

Important: With eager rebalancing — full stop. With cooperative (Kafka 2.3+) — processing continues on remaining partitions, only moved partitions experience downtime.

Code example

// Consumer automatically participates in rebalance
props.put("group.id", "my-group");
consumer.subscribe(List.of("orders"));
// When a new consumer with the same group.id is added
// an automatic rebalance will occur

Middle Level

Rebalancing process (step by step)

1. Trigger — event (new consumer, timeout, etc.)
2. Group coordinator starts rebalance
3. All consumers revoke their partitions
4. New assignment is computed
5. Consumers receive new partitions
6. Reading resumes

Types of rebalance

Eager Rebalancing (old):

All consumers stop completely
Full partition redistribution
All consumers resume work
→ Full group downtime

Cooperative Rebalancing (Kafka 2.3+):

Gradual partition movement
Consumers continue working with remaining partitions
→ Minimal downtime

ConsumerRebalanceListener

consumer.subscribe(List.of("orders"), new ConsumerRebalanceListener() {
    @Override
    public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
        // Commit offsets before losing partitions
        consumer.commitSync();
        // Clean up local resources
        cleanup(partitions);
    }

    @Override
    public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
        // Initialize new partitions
        initialize(partitions);
    }
});

Max Poll Interval Exceeded

// If batch processing takes too long
props.put("max.poll.interval.ms", "300000");  // 5 minutes

// If consumer doesn't call poll() within this time
// → considered dead → excluded from group → rebalance

Common mistakes

  1. Frequent rebalances:
    Max poll interval exceeded → consumer excluded → rebalance
    → Downtime → performance loss
    
  2. Without ConsumerRebalanceListener:
    Data loss on rebalance (offsets not committed)
    
  3. Too short timeout:
    session.timeout.ms=5000 → false positives
    → Consumer considered dead → unnecessary rebalance
    

Senior Level

Internal Implementation

Group Coordinator:

Coordinator broker stores group state:
- Member list (list of consumers)
- Partition assignments
- Committed offsets (in __consumer_offsets)

Rebalancing protocol:

Glossary:
  Group Coordinator — broker managing group membership.
  JoinGroup — consumer says "I want to join the group".
  SyncGroup — leader distributes the assignment result.

1. Join Phase:
   - Consumer → JoinGroup request
   - Coordinator → collects all members
   - Coordinator → chooses group leader

2. Assignment Phase:
   - Leader → computes assignment
   - Leader → SyncGroup request
   - Coordinator → distributes assignments

3. Heartbeat Phase:
   - Consumers → Heartbeat requests
   - Coordinator → confirms alive

Cooperative Rebalancing — in detail

Incremental Cooperative Rebalancing:

Instead of full stop:
1. Revoke some partitions
2. Assign new partitions
3. Continue working
4. Repeat if needed

Advantages:
- Minimal downtime
- Preserve data locality
- Better availability

Configuration:

props.put("partition.assignment.strategy",
    "org.apache.kafka.clients.consumer.CooperativeStickyAssignor");

Tuning Parameters

# Session management
session.timeout.ms: 30000           # 30 seconds until detection
heartbeat.interval.ms: 10000        # 10 seconds (1/3 of session timeout)

# Poll management
max.poll.interval.ms: 300000        # 5 minutes for processing
max.poll.records: 500               # batch for processing

# Assignment strategy
partition.assignment.strategy: cooperative-sticky

Failure Scenarios

1. Network Partition:

Consumer loses connection to broker
→ Heartbeat timeout → session timeout → rebalance
→ When network recovers → consumer reconnect → join group

2. GC Pause:

session.timeout.ms=30000, GC pause=35 seconds.
Consumer doesn't send heartbeat for 35 seconds. Broker waits 30 seconds → excludes →
rebalance. After GC, consumer reconnects → joins group → gets new partitions.

Long GC pause → consumer doesn't send heartbeat
→ Broker considers consumer dead → rebalance
→ After GC, consumer learns about rebalance

3. Slow Processing:

Processing batch > max.poll.interval.ms
→ Broker considers consumer dead → rebalance
→ Consumer continues → receives new assignment

Monitoring Rebalancing

Key metrics:

kafka.consumer.coordinator:num-rebalances
kafka.consumer.coordinator:time-since-last-rebalance
kafka.consumer:assigned-partitions
kafka.consumer:commit-latency-avg

Alerts:

- Rebalance more often than every 10 minutes → warning
- Rebalance more often than every 5 minutes → critical
- Consumer hasn't sent heartbeat > 30s → critical
- Consumer lag growing → warning

Troubleshooting

1. Identify the cause:

// Logs show:
"Heartbeat session expired"  session.timeout.ms too short
"Max poll interval exceeded"  processing too slow
"Member is dead"  consumer crashed

2. Resolving issues:

Heartbeat issues:
  → Increase session.timeout.ms
  → Check network connectivity

Poll interval issues:
  → Reduce max.poll.records
  → Optimize processing
  → Increase max.poll.interval.ms

Best Practices

✅ Cooperative rebalancing
✅ Sufficient session timeout (30s+)
✅ Fast message processing
✅ ConsumerRebalanceListener for graceful handling
✅ Static membership for stable instances
✅ Monitor rebalance frequency

❌ Frequent rebalances without investigation
❌ Eager rebalancing for production
❌ Too short timeout
❌ Without onPartitionsRevoked handling
❌ Ignoring consumer lag

Architectural decisions

  1. Cooperative rebalancing — standard for production
  2. Static membership — for stable instances
  3. Proper tuning — balance between detection time and false positives
  4. Monitoring — early problem detection

Summary for Senior

  • Rebalancing — critical process for group availability
  • Cooperative rebalancing minimizes disruption
  • Group coordinator protocol: join → sync → heartbeat
  • Parameter tuning is critical for stability
  • Monitoring rebalance frequency — early problem indicator

🎯 Interview Cheat Sheet

Must know:

  • Rebalancing — redistribution of partitions among consumers in a group
  • Triggers: new consumer, crash, session timeout, adding partitions
  • Eager rebalancing: full stop 5-30 seconds; Cooperative: 1-5 seconds, processing continues
  • ConsumerRebalanceListener: commitSync in onPartitionsRevoked, cleanup state
  • Max poll interval exceeded — common cause: consumer doesn’t call poll() in time
  • Group Coordinator protocol: JoinGroup → Assignment (Leader) → SyncGroup → Heartbeat
  • Static membership (group.instance.id) — eliminates rebalance on rolling deploy

Common follow-up questions:

  • What is eager vs cooperative rebalancing? — Eager: all stop. Cooperative: only moved partitions.
  • What happens with too short session.timeout? — False positives, unnecessary rebalances.
  • How to avoid rebalance on GC pause? — Increase session.timeout.ms, static membership.
  • Who computes the assignment? — Leader consumer (eager) or Group Coordinator (KIP-848, Kafka 3.0+).

Red flags (DO NOT say):

  • “Rebalancing is fast and painless” — 5-30 seconds downtime (eager)
  • “You can ignore frequent rebalances” — sign of timeout or processing problems
  • “Eager rebalancing is best for production” — cooperative minimizes disruption
  • “Rebalancing doesn’t affect processing” — full stop with eager

Related topics:

  • [[5. What is Consumer Group]]
  • [[6. How does consumer balancing work in a group]]
  • [[8. What happens when you add a new consumer to the group]]
  • [[14. What is the difference between auto commit and manual commit]]