How does Kafka ensure fault tolerance

Junior Level

Key Mechanisms

Kafka ensures fault tolerance through replication:

Replication — data on multiple brokers
Leader election — automatic selection of a new leader
Consumer group — consumers restart and continue

Failover Example

Broker A (Leader) → goes down
Broker B (Follower) → becomes Leader
Consumers → continue reading from the new leader
Producers → continue writing to the new leader

Replication Factor

replication-factor=3 → tolerates failure of 2 brokers (assuming replicas are
distributed across different brokers, which happens by default with enough
brokers).
replication-factor=2 → tolerates failure of 1 broker
replication-factor=1 → no fault tolerance

Creating a Fault-Tolerant Topic

kafka-topics.sh --create \
  --topic orders \
  --partitions 3 \
  --replication-factor 3 \
  --bootstrap-server localhost:9092

Middle Level

Fault Tolerance at Different Levels

Broker:

RF=3 → data on 3 brokers
Tolerates failure of 2 brokers

Producer:

acks=all + retries → automatic retry
enable.idempotence → duplicate protection

Consumer:

Committed offsets → continues after restart
Consumer group → automatic rebalancing

Leader Election

Leader goes down
Controller detects the failure
Selects a new leader from ISR
Producers and consumers switch over

Min.insync.replicas

min.insync.replicas=2:
  Writes are only allowed if >= 2 replicas are in ISR
  Protects against writing to a single copy

Common Mistakes

Insufficient replication:

RF=2, min.insync=1 → data loss on leader failure

Unclean leader election:

unclean.leader.election.enable=true → risk of data loss

Senior Level

Cross-Datacenter Replication

MirrorMaker 2:

Replication between data centers
Active-passive or active-active topology
Asynchronous replication

Topology options:

1. Active-Passive:
   DC1 (active) → DC2 (passive)
   If DC1 goes down → DC2 becomes active

2. Active-Active:
   DC1 ← → DC2
   Both DCs accept writes
   More complex consistency

Failure Scenarios

1. Single Broker Failure:

RF=3, min.insync=2:
  Broker A (Leader) goes down
  ISR=[B,C] → new leader from B or C
  Writes continue (ISR >= 2)
  Data is not lost

2. Network Partition:

Brokers split into two groups:
  Group 1: Broker A, B
  Group 2: Broker C

Group 1: ISR=[A,B] → works
Group 2: ISR=[C] → writes are rejected (ISR < min.insync)

3. Datacenter Failure:

Active-Passive:
  DC1 goes down → MirrorMaker stops
  DC2 is activated → continues operation
  RPO = data replicated before the failure
  RTO = time to activate DC2

Recovery Procedures

1. Broker Recovery:

Failed broker restarts:
  → Connects to the cluster
  → Starts replicating from the leader
  → When it catches up → added back to ISR
  → Can become leader on next failover

2. Datacenter Recovery:

DC1 is restored:
  → MirrorMaker resumes replication
  → Syncs data between DCs
  → Switch back or continue in DC2

Production Configuration

# Topic level
replication-factor: 3
min.insync.replicas: 2
unclean.leader.election.enable: false

# Broker level
replica.lag.time.max.ms: 30000
auto.leader.rebalance.enable: true

# Producer level
acks: all
enable.idempotence: true
retries: Integer.MAX_VALUE

# Consumer level
enable.auto.commit: false
isolation.level: read_committed
// isolation.level=read_committed is needed to read only committed
// transactional messages, ignoring aborted ones.

Monitoring

Key metrics:

kafka.server:UnderReplicatedPartitions
kafka.server:IsrShrinksPerSec
kafka.server:ActiveControllerCount
kafka.server:OfflinePartitionsCount
kafka.server:LeaderElectionRate

Alerts:

- Under-replicated partitions > 0 → warning
- Offline partitions > 0 → critical
- ISR shrinks rate > threshold → warning
- Controller changed → investigate
- Leader election rate > threshold → warning

Best Practices

✅ RF=3 for production
✅ min.insync.replicas=2
✅ acks=all for critical data
✅ Cross-DC replication for disaster recovery
✅ Automatic leader rebalancer
✅ Monitor under-replicated partitions

❌ RF < 3 for production
❌ Single data center for critical data
❌ unclean.leader.election.enable=true
❌ min.insync.replicas=1 with acks=all
❌ Without monitoring ISR health

Architectural Decisions

RF=3 + min.insync=2 — balance between availability and durability
Cross-DC replication — for disaster recovery
Unclean election = data loss — avoid in production
Monitoring under-replicated partitions — early indicator of problems

Summary for Senior

Fault tolerance is built on replication and ISR
RF=3 tolerates failure of 2 brokers
Cross-DC replication for disaster recovery
Monitoring under-replicated partitions is critical
Unclean leader election is a last-resort measure with data loss risk

🎯 Interview Cheat Sheet

Must know:

Fault tolerance is built on replication (RF=3), ISR, and leader election
RF=3 tolerates 2 broker failures; RF=2 — 1 broker failure
min.insync.replicas=2 + acks=all — protects against writing to a single copy
Producer: acks=all + enable.idempotence=true + retries=INT_MAX
Consumer: committed offsets + consumer group auto-rebalancing on failover
On network partition: the group with the ISR majority works, the rest — writes are rejected
Cross-DC replication (MirrorMaker 2) — for disaster recovery

Common follow-up questions:

What happens when one broker goes down with RF=3? — A follower becomes leader, data is not lost.
How does the producer handle failover? — Retries + idempotence → duplicates are rejected by the broker.
What happens on network partition? — The group with the ISR majority continues, minority — writes are rejected.
How does a consumer recover? — Committed offsets → continues from the last saved offset.

Red flags (DO NOT say):

“RF=2 is enough for critical data” — min.insync=1 at RF=2 = data loss risk
“Kafka never loses data” — it does with unclean leader election
“Consumer must manually restore offsets” — automatically via consumer group
“Cross-DC replication is synchronous” — it’s asynchronous, RPO > 0

Related topics:

[[16. What is replication in Kafka]]
[[18. What is ISR (In-Sync Replicas)]]
[[20. What is producer acknowledgment and what modes exist (acks=0,1,all)]]
[[9. What delivery guarantees does Kafka provide]]