How does Kafka ensure fault tolerance
Kafka ensures fault tolerance through replication:
Junior Level
Key Mechanisms
Kafka ensures fault tolerance through replication:
1. Replication — data on multiple brokers
2. Leader election — automatic selection of a new leader
3. Consumer group — consumers restart and continue
Failover Example
Broker A (Leader) → goes down
Broker B (Follower) → becomes Leader
Consumers → continue reading from the new leader
Producers → continue writing to the new leader
Replication Factor
replication-factor=3 → tolerates failure of 2 brokers (assuming replicas are
distributed across different brokers, which happens by default with enough
brokers).
replication-factor=2 → tolerates failure of 1 broker
replication-factor=1 → no fault tolerance
Creating a Fault-Tolerant Topic
kafka-topics.sh --create \
--topic orders \
--partitions 3 \
--replication-factor 3 \
--bootstrap-server localhost:9092
Middle Level
Fault Tolerance at Different Levels
Broker:
RF=3 → data on 3 brokers
Tolerates failure of 2 brokers
Producer:
acks=all + retries → automatic retry
enable.idempotence → duplicate protection
Consumer:
Committed offsets → continues after restart
Consumer group → automatic rebalancing
Leader Election
1. Leader goes down
2. Controller detects the failure
3. Selects a new leader from ISR
4. Producers and consumers switch over
Min.insync.replicas
min.insync.replicas=2:
Writes are only allowed if >= 2 replicas are in ISR
Protects against writing to a single copy
Common Mistakes
- Insufficient replication:
RF=2, min.insync=1 → data loss on leader failure - Unclean leader election:
unclean.leader.election.enable=true → risk of data loss
Senior Level
Cross-Datacenter Replication
MirrorMaker 2:
Replication between data centers
Active-passive or active-active topology
Asynchronous replication
Topology options:
1. Active-Passive:
DC1 (active) → DC2 (passive)
If DC1 goes down → DC2 becomes active
2. Active-Active:
DC1 ← → DC2
Both DCs accept writes
More complex consistency
Failure Scenarios
1. Single Broker Failure:
RF=3, min.insync=2:
Broker A (Leader) goes down
ISR=[B,C] → new leader from B or C
Writes continue (ISR >= 2)
Data is not lost
2. Network Partition:
Brokers split into two groups:
Group 1: Broker A, B
Group 2: Broker C
Group 1: ISR=[A,B] → works
Group 2: ISR=[C] → writes are rejected (ISR < min.insync)
3. Datacenter Failure:
Active-Passive:
DC1 goes down → MirrorMaker stops
DC2 is activated → continues operation
RPO = data replicated before the failure
RTO = time to activate DC2
Recovery Procedures
1. Broker Recovery:
Failed broker restarts:
→ Connects to the cluster
→ Starts replicating from the leader
→ When it catches up → added back to ISR
→ Can become leader on next failover
2. Datacenter Recovery:
DC1 is restored:
→ MirrorMaker resumes replication
→ Syncs data between DCs
→ Switch back or continue in DC2
Production Configuration
# Topic level
replication-factor: 3
min.insync.replicas: 2
unclean.leader.election.enable: false
# Broker level
replica.lag.time.max.ms: 30000
auto.leader.rebalance.enable: true
# Producer level
acks: all
enable.idempotence: true
retries: Integer.MAX_VALUE
# Consumer level
enable.auto.commit: false
isolation.level: read_committed
// isolation.level=read_committed is needed to read only committed
// transactional messages, ignoring aborted ones.
Monitoring
Key metrics:
kafka.server:UnderReplicatedPartitions
kafka.server:IsrShrinksPerSec
kafka.server:ActiveControllerCount
kafka.server:OfflinePartitionsCount
kafka.server:LeaderElectionRate
Alerts:
- Under-replicated partitions > 0 → warning
- Offline partitions > 0 → critical
- ISR shrinks rate > threshold → warning
- Controller changed → investigate
- Leader election rate > threshold → warning
Best Practices
✅ RF=3 for production
✅ min.insync.replicas=2
✅ acks=all for critical data
✅ Cross-DC replication for disaster recovery
✅ Automatic leader rebalancer
✅ Monitor under-replicated partitions
❌ RF < 3 for production
❌ Single data center for critical data
❌ unclean.leader.election.enable=true
❌ min.insync.replicas=1 with acks=all
❌ Without monitoring ISR health
Architectural Decisions
- RF=3 + min.insync=2 — balance between availability and durability
- Cross-DC replication — for disaster recovery
- Unclean election = data loss — avoid in production
- Monitoring under-replicated partitions — early indicator of problems
Summary for Senior
- Fault tolerance is built on replication and ISR
- RF=3 tolerates failure of 2 brokers
- Cross-DC replication for disaster recovery
- Monitoring under-replicated partitions is critical
- Unclean leader election is a last-resort measure with data loss risk
🎯 Interview Cheat Sheet
Must know:
- Fault tolerance is built on replication (RF=3), ISR, and leader election
- RF=3 tolerates 2 broker failures; RF=2 — 1 broker failure
min.insync.replicas=2+acks=all— protects against writing to a single copy- Producer:
acks=all+enable.idempotence=true+retries=INT_MAX - Consumer: committed offsets + consumer group auto-rebalancing on failover
- On network partition: the group with the ISR majority works, the rest — writes are rejected
- Cross-DC replication (MirrorMaker 2) — for disaster recovery
Common follow-up questions:
- What happens when one broker goes down with RF=3? — A follower becomes leader, data is not lost.
- How does the producer handle failover? — Retries + idempotence → duplicates are rejected by the broker.
- What happens on network partition? — The group with the ISR majority continues, minority — writes are rejected.
- How does a consumer recover? — Committed offsets → continues from the last saved offset.
Red flags (DO NOT say):
- “RF=2 is enough for critical data” — min.insync=1 at RF=2 = data loss risk
- “Kafka never loses data” — it does with unclean leader election
- “Consumer must manually restore offsets” — automatically via consumer group
- “Cross-DC replication is synchronous” — it’s asynchronous, RPO > 0
Related topics:
- [[16. What is replication in Kafka]]
- [[18. What is ISR (In-Sync Replicas)]]
- [[20. What is producer acknowledgment and what modes exist (acks=0,1,all)]]
- [[9. What delivery guarantees does Kafka provide]]