What is replication in Kafka
4. ISR monitoring — early indicator of problems
Junior Level
Definition
Replication is the copying of partition data across multiple brokers for fault tolerance.
Partition 0:
Broker A → Leader (accepts writes)
Broker B → Follower (copy)
Broker C → Follower (copy)
Why is replication needed?
If the leader goes down:
One of the followers becomes the new leader
Data is not lost (assuming unclean.leader.election.enable=false and sync replicas exist).
Consumers continue reading
Replication Factor
replication-factor=3 → 3 copies on different brokers
replication-factor=1 → no fault tolerance
Example: Creating a Topic
kafka-topics.sh --create \
--topic orders \
--partitions 3 \
--replication-factor 3 \
--bootstrap-server localhost:9092
When a High Replication Factor is NOT Needed
- Dev/test environments — RF=1 is acceptable
- High-volume metrics/logs where data loss is tolerable — RF=1 or 2
- Resource constraints — each replica = xN disk space
Middle Level
Leader and Followers
Partition 0:
Broker 1 → Leader (read/write)
Broker 2 → Follower (read-only, replicates from leader)
Broker 3 → Follower (read-only, replicates from leader)
All writes go through the leader
Followers passively copy data
ISR (In-Sync Replicas)
ISR — a list of replicas that are synchronized with the leader
Example:
Leader: Broker 1
ISR: [Broker 1, Broker 2, Broker 3]
If Broker 3 falls behind:
ISR: [Broker 1, Broker 2]
min.insync.replicas
Minimum number of replicas in ISR required for a write
min.insync.replicas=2:
acks=all + at least 2 replicas in ISR
If ISR < 2 → write is rejected
Failover
Leader goes down:
Controller broker selects a new leader from ISR
Automatic process
Consumers and producers switch over
Common Mistakes
- Too small replication factor:
replication-factor=1 → no fault tolerance If the broker goes down, data is lost - min.insync.replicas=1 with acks=all:
No protection against data loss Leader goes down → data lost
Senior Level
Leader Election
Controller Broker:
One of the brokers is the cluster controller
Responsible for:
- Selecting leaders for partitions
- Managing ISR
- Handling failover
If the controller goes down → a new controller is elected
Leader Election Process:
1. Leader goes down
2. Controller detects the failure
3. Selects a new leader from ISR
4. Updates metadata
5. Producers and consumers switch over
ISR Management
Expanding ISR:
Follower catches up with the leader → added to ISR
Shrinking ISR:
Follower falls behind → removed from ISR
Criterion: replica.lag.time.max.ms (30 seconds by default)
Cross-Datacenter Replication
MirrorMaker 2:
(available since Kafka 2.4, replaces the legacy MirrorMaker 1)
Replication between data centers
Active-passive or active-active topology
Asynchronous replication
Considerations:
- Network latency between DCs
- Bandwidth costs
- RPO (Recovery Point Objective)
- RTO (Recovery Time Objective)
Production Configuration
# Topic level
replication-factor: 3
min.insync.replicas: 2
# Broker level
unclean.leader.election.enable: false
replica.lag.time.max.ms: 30000
Best Practices
✅ RF=3 for production
✅ min.insync.replicas=2
✅ acks=all for critical data
✅ unclean.leader.election.enable=false
✅ Even leader distribution
✅ Monitor ISR shrink/expansion
❌ RF=1 for production
❌ min.insync=1 with acks=all
❌ Unclean leader election
❌ Ignoring ISR shrink
❌ Single data center for critical data
Architectural Decisions
- RF=3 — production standard
- min.insync.replicas=2 — balance between availability and durability
- Cross-DC replication — for disaster recovery
- ISR monitoring — early indicator of problems
Summary for Senior
- Replication is the foundation of Kafka fault tolerance
- ISR is a critical mechanism for consistency
- Leader election from ISR guarantees no data loss
- Cross-DC replication for disaster recovery
- ISR health monitoring is mandatory for production
🎯 Interview Cheat Sheet
Must know:
- Replication = copying partition data to multiple brokers for fault tolerance
- Replication Factor (RF): RF=3 is the production standard, tolerates up to 2 broker failures
- Leader accepts read/write, Followers passively copy data
- ISR (In-Sync Replicas) — replicas synchronized with the leader
min.insync.replicas=2at RF=3: write is rejected if ISR < 2unclean.leader.election.enable=false— leader only from ISR, otherwise data loss- When the leader goes down: Controller selects a new leader from ISR
Common follow-up questions:
- What happens with RF=1? — No fault tolerance; if the broker goes down, data is lost.
- Why min.insync.replicas=2? — Protection against writing to a single copy (leader down = data loss).
- What is unclean leader election? — Leader from a non-ISR follower → data loss.
- How does cross-DC replication work? — MirrorMaker 2: asynchronous replication between data centers.
Red flags (DO NOT say):
- “RF=1 is enough for production” — no fault tolerance
- “All replicas accept writes” — only the Leader; Followers are read-only
min.insync.replicas=1 with acks=all— no protection against data loss- “Unclean leader election is safe” — data is lost
Related topics:
- [[17. What are leader and follower replicas]]
- [[18. What is ISR (In-Sync Replicas)]]
- [[19. How does Kafka ensure fault tolerance]]
- [[20. What is producer acknowledgment and what modes exist (acks=0,1,all)]]