Question 8 · Section 15

What happens when you add a new consumer to the group

When you add a new consumer, a rebalance occurs — partitions are redistributed among all consumers.

Language versions: English Russian Ukrainian

Junior Level

Short answer

When you add a new consumer, a rebalance occurs — partitions are redistributed among all consumers.

Example

Before adding:
  C1 → P0, P1
  C2 → P2

Added C3:
  C1 → P0
  C2 → P1
  C3 → P2

What happens during rebalance?

Eager rebalancing: 5-30 seconds of full downtime.
Cooperative rebalancing: 1-5 seconds with continued processing on remaining partitions.
Messages accumulate in Kafka
After rebalance, reading resumes

How to add a consumer?

// Just launch a new consumer with the same group.id
props.put("group.id", "my-group");
consumer.subscribe(List.of("orders"));
// Kafka will automatically trigger a rebalance

Middle Level

Rebalancing process (step by step)

1. New consumer connects to broker
2. Sends JoinGroup request
3. Group coordinator starts rebalance
4. All consumers receive notification
5. All consumers stop reading (eager)
   or continue with some partitions (cooperative)
6. New assignment is computed
7. Assignment is distributed to consumers
8. Consumers start reading new partitions

Rebalancing Strategies

1. Eager (old approach):

All consumers stop completely
Full redistribution
All resume work
→ Full group downtime

2. Cooperative (Kafka 2.3+):

Gradual partition movement
Consumers continue working with remaining partitions
→ Minimal downtime

ConsumerRebalanceListener

consumer.subscribe(List.of("orders"), new ConsumerRebalanceListener() {
    @Override
    public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
        // Commit offsets before losing partitions
        consumer.commitSync();
        // Clean up local state
        cleanupState(partitions);
    }

    @Override
    public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
        // Initialize new partitions
        initializeState(partitions);
        // Can start reading from a specific offset
        for (TopicPartition tp : partitions) {
            consumer.seek(tp, getSavedOffset(tp));
        }
    }
});

Frequent rebalances — a problem

Unstable consumers → constant rebalances → system downtime

Causes:
- session.timeout.ms too short
- max.poll.interval.ms exceeded (long processing)
- Network issues
- GC pauses

Symptoms:
- Log messages: "Rebalance triggered"
- Consumer lag growth
- Throughput drop

Common mistakes

  1. Without ConsumerRebalanceListener:
    Data loss on rebalance (offsets not committed)
    
  2. Frequent rebalances:
    Consumers crash → rebalance → crash again → cycle
    
  3. Long processing between poll:
    max.poll.interval.ms exceeded → consumer excluded → rebalance
    

Senior Level

Cooperative Rebalancing — in detail

Incremental Cooperative Rebalancing (Kafka 2.3+):

Eager:
  Revoke all partitions → Assign new → Resume
  → Full downtime

Cooperative:
  Step-by-step movement:
  1. Revoke some partitions
  2. Assign new partitions
  3. Continue working
  4. Repeat if needed
  → Minimal downtime

Configuration:

props.put("partition.assignment.strategy",
    "org.apache.kafka.clients.consumer.CooperativeStickyAssignor");

Group Coordinator Protocol

Rebalancing protocol:

JoinGroup — consumer says "I want to join the group".
SyncGroup — leader distributes the assignment result.

1. Join Phase:
   - Consumer → JoinGroup request
   - Coordinator → collects all members
   - Coordinator → chooses leader

2. Assignment Phase:
   - Leader → computes assignment
   - Leader → SyncGroup request with assignment
   - Coordinator → distributes assignments

3. Steady State:
   - Consumers → Heartbeat requests
   - Coordinator → confirms alive
   - Consumers → process messages

Production Configuration

partition.assignment.strategy: cooperative-sticky
session.timeout.ms: 30000           # 30 seconds
heartbeat.interval.ms: 10000        # 10 seconds (1/3)
max.poll.interval.ms: 300000        # 5 minutes
max.poll.records: 500               # batch
enable.auto.commit: false

Static Membership

// Permanent consumer ID
props.put("group.instance.id", "consumer-1");

// On restart:
// - Doesn't trigger rebalance
// - Returns the same partitions
// - Waits session.timeout.ms before rebalance

Monitoring Rebalancing

Key metrics:

kafka.consumer.coordinator:num-rebalances
kafka.consumer.coordinator:time-since-last-rebalance
kafka.consumer:assigned-partitions
kafka.consumer:commit-latency-avg

Alerts:

- Rebalance more often than every 10 minutes → warning
- Rebalance more often than every 5 minutes → critical
- Consumer hasn't sent heartbeat > 30s → critical
- Consumer lag growing → warning

Troubleshooting Rebalancing

1. Identify the cause:

// Logs show:
"Heartbeat session expired"  session.timeout.ms too short
"Max poll interval exceeded"  processing too slow
"Member is dead"  consumer crashed

2. Resolving issues:

Heartbeat issues:
  Increase session.timeout.ms
  Increase heartbeat.interval.ms

Poll interval issues:
  Reduce max.poll.records
  Optimize message processing
  Increase max.poll.interval.ms

Network issues:
  Check connectivity to brokers
  Check firewall rules

Advanced Patterns

1. Gradual Rollout:

Add consumers gradually:
1. Launch one consumer
2. Wait for stabilization
3. Check lag and throughput
4. Add the next one

2. Partition-aware State Management:

public class StatefulConsumer {
    private final Map<TopicPartition, State> states = new HashMap<>();

    public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
        // Save state for each partition
        for (TopicPartition tp : partitions) {
            states.get(tp).save();
            states.remove(tp);
        }
    }

    public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
        // Load state for new partitions
        for (TopicPartition tp : partitions) {
            states.put(tp, State.load(tp));
        }
    }
}

Best Practices

✅ Cooperative-sticky assignor
✅ Sufficient session timeout (30s+)
✅ Fast message processing
✅ ConsumerRebalanceListener for graceful handling
✅ Static membership for stable instances
✅ Monitor rebalance frequency

❌ Eager assignor acceptable for static clusters. For production with rolling deploy — use CooperativeSticky.
❌ Too short timeout
❌ Without onPartitionsRevoked handling
❌ Ignoring consumer lag

Architectural decisions

  1. Cooperative rebalancing — standard for production
  2. Static membership — for stable instances
  3. State management — save state on rebalance
  4. Monitoring — early problem detection

Summary for Senior

  • Rebalancing — critical process for group availability
  • Cooperative rebalancing minimizes disruption
  • ConsumerRebalanceListener is mandatory for production
  • Static membership reduces unnecessary rebalances
  • Monitoring rebalance frequency — early problem indicator
  • Troubleshooting requires analysis of logs and metrics

🎯 Interview Cheat Sheet

Must know:

  • Adding a new consumer triggers a rebalance — partitions are redistributed
  • Eager rebalancing: 5-30 seconds full downtime; Cooperative: 1-5 seconds
  • ConsumerRebalanceListener is mandatory: commitSync in onPartitionsRevoked
  • Frequent rebalances — a problem: session.timeout exceeded, max.poll.interval exceeded
  • CooperativeStickyAssignor — standard for production (Kafka 2.3+)
  • Static membership (group.instance.id) — eliminates rebalance on rolling deploy
  • Gradual rollout: add consumers gradually, check lag

Common follow-up questions:

  • What happens if a consumer processes for too long? — max.poll.interval exceeded → exclusion → rebalance.
  • How to speed up rebalance? — CooperativeStickyAssignor + static membership.
  • What to do with stateful consumer on rebalance? — Save state in onPartitionsRevoked, load in onPartitionsAssigned.
  • Can you disable rebalance? — No, but static membership minimizes it for known instances.

Red flags (DO NOT say):

  • “Rebalancing happens instantly” — 5-30 seconds eager, 1-5 cooperative
  • “You can ignore onPartitionsRevoked” — loss of offsets and state
  • “Frequent rebalances are normal” — sign of timeout or processing problems
  • “Eager rebalancing is the best choice” — cooperative minimizes disruption

Related topics:

  • [[15. What is rebalancing and when does it happen]]
  • [[6. How does consumer balancing work in a group]]
  • [[5. What is Consumer Group]]
  • [[14. What is the difference between auto commit and manual commit]]