Question 19 · Section 15

How does Kafka ensure fault tolerance

Kafka ensures fault tolerance through replication:

Language versions: English Russian Ukrainian

Junior Level

Key Mechanisms

Kafka ensures fault tolerance through replication:

1. Replication — data on multiple brokers
2. Leader election — automatic selection of a new leader
3. Consumer group — consumers restart and continue

Failover Example

Broker A (Leader) → goes down
Broker B (Follower) → becomes Leader
Consumers → continue reading from the new leader
Producers → continue writing to the new leader

Replication Factor

replication-factor=3 → tolerates failure of 2 brokers (assuming replicas are
distributed across different brokers, which happens by default with enough
brokers).
replication-factor=2 → tolerates failure of 1 broker
replication-factor=1 → no fault tolerance

Creating a Fault-Tolerant Topic

kafka-topics.sh --create \
  --topic orders \
  --partitions 3 \
  --replication-factor 3 \
  --bootstrap-server localhost:9092

Middle Level

Fault Tolerance at Different Levels

Broker:

RF=3 → data on 3 brokers
Tolerates failure of 2 brokers

Producer:

acks=all + retries → automatic retry
enable.idempotence → duplicate protection

Consumer:

Committed offsets → continues after restart
Consumer group → automatic rebalancing

Leader Election

1. Leader goes down
2. Controller detects the failure
3. Selects a new leader from ISR
4. Producers and consumers switch over

Min.insync.replicas

min.insync.replicas=2:
  Writes are only allowed if >= 2 replicas are in ISR
  Protects against writing to a single copy

Common Mistakes

  1. Insufficient replication:
    RF=2, min.insync=1 → data loss on leader failure
    
  2. Unclean leader election:
    unclean.leader.election.enable=true → risk of data loss
    

Senior Level

Cross-Datacenter Replication

MirrorMaker 2:

Replication between data centers
Active-passive or active-active topology
Asynchronous replication

Topology options:

1. Active-Passive:
   DC1 (active) → DC2 (passive)
   If DC1 goes down → DC2 becomes active

2. Active-Active:
   DC1 ← → DC2
   Both DCs accept writes
   More complex consistency

Failure Scenarios

1. Single Broker Failure:

RF=3, min.insync=2:
  Broker A (Leader) goes down
  ISR=[B,C] → new leader from B or C
  Writes continue (ISR >= 2)
  Data is not lost

2. Network Partition:

Brokers split into two groups:
  Group 1: Broker A, B
  Group 2: Broker C

Group 1: ISR=[A,B] → works
Group 2: ISR=[C] → writes are rejected (ISR < min.insync)

3. Datacenter Failure:

Active-Passive:
  DC1 goes down → MirrorMaker stops
  DC2 is activated → continues operation
  RPO = data replicated before the failure
  RTO = time to activate DC2

Recovery Procedures

1. Broker Recovery:

Failed broker restarts:
  → Connects to the cluster
  → Starts replicating from the leader
  → When it catches up → added back to ISR
  → Can become leader on next failover

2. Datacenter Recovery:

DC1 is restored:
  → MirrorMaker resumes replication
  → Syncs data between DCs
  → Switch back or continue in DC2

Production Configuration

# Topic level
replication-factor: 3
min.insync.replicas: 2
unclean.leader.election.enable: false

# Broker level
replica.lag.time.max.ms: 30000
auto.leader.rebalance.enable: true

# Producer level
acks: all
enable.idempotence: true
retries: Integer.MAX_VALUE

# Consumer level
enable.auto.commit: false
isolation.level: read_committed
// isolation.level=read_committed is needed to read only committed
// transactional messages, ignoring aborted ones.

Monitoring

Key metrics:

kafka.server:UnderReplicatedPartitions
kafka.server:IsrShrinksPerSec
kafka.server:ActiveControllerCount
kafka.server:OfflinePartitionsCount
kafka.server:LeaderElectionRate

Alerts:

- Under-replicated partitions > 0 → warning
- Offline partitions > 0 → critical
- ISR shrinks rate > threshold → warning
- Controller changed → investigate
- Leader election rate > threshold → warning

Best Practices

✅ RF=3 for production
✅ min.insync.replicas=2
✅ acks=all for critical data
✅ Cross-DC replication for disaster recovery
✅ Automatic leader rebalancer
✅ Monitor under-replicated partitions

❌ RF < 3 for production
❌ Single data center for critical data
❌ unclean.leader.election.enable=true
❌ min.insync.replicas=1 with acks=all
❌ Without monitoring ISR health

Architectural Decisions

  1. RF=3 + min.insync=2 — balance between availability and durability
  2. Cross-DC replication — for disaster recovery
  3. Unclean election = data loss — avoid in production
  4. Monitoring under-replicated partitions — early indicator of problems

Summary for Senior

  • Fault tolerance is built on replication and ISR
  • RF=3 tolerates failure of 2 brokers
  • Cross-DC replication for disaster recovery
  • Monitoring under-replicated partitions is critical
  • Unclean leader election is a last-resort measure with data loss risk

🎯 Interview Cheat Sheet

Must know:

  • Fault tolerance is built on replication (RF=3), ISR, and leader election
  • RF=3 tolerates 2 broker failures; RF=2 — 1 broker failure
  • min.insync.replicas=2 + acks=all — protects against writing to a single copy
  • Producer: acks=all + enable.idempotence=true + retries=INT_MAX
  • Consumer: committed offsets + consumer group auto-rebalancing on failover
  • On network partition: the group with the ISR majority works, the rest — writes are rejected
  • Cross-DC replication (MirrorMaker 2) — for disaster recovery

Common follow-up questions:

  • What happens when one broker goes down with RF=3? — A follower becomes leader, data is not lost.
  • How does the producer handle failover? — Retries + idempotence → duplicates are rejected by the broker.
  • What happens on network partition? — The group with the ISR majority continues, minority — writes are rejected.
  • How does a consumer recover? — Committed offsets → continues from the last saved offset.

Red flags (DO NOT say):

  • “RF=2 is enough for critical data” — min.insync=1 at RF=2 = data loss risk
  • “Kafka never loses data” — it does with unclean leader election
  • “Consumer must manually restore offsets” — automatically via consumer group
  • “Cross-DC replication is synchronous” — it’s asynchronous, RPO > 0

Related topics:

  • [[16. What is replication in Kafka]]
  • [[18. What is ISR (In-Sync Replicas)]]
  • [[20. What is producer acknowledgment and what modes exist (acks=0,1,all)]]
  • [[9. What delivery guarantees does Kafka provide]]