What is replication in Kafka

Junior Level

Definition

Replication is the copying of partition data across multiple brokers for fault tolerance.

Partition 0:
  Broker A → Leader (accepts writes)
  Broker B → Follower (copy)
  Broker C → Follower (copy)

Why is replication needed?

If the leader goes down:
  One of the followers becomes the new leader
  Data is not lost (assuming unclean.leader.election.enable=false and sync replicas exist).
  Consumers continue reading

Replication Factor

replication-factor=3 → 3 copies on different brokers
replication-factor=1 → no fault tolerance

Example: Creating a Topic

kafka-topics.sh --create \
  --topic orders \
  --partitions 3 \
  --replication-factor 3 \
  --bootstrap-server localhost:9092

When a High Replication Factor is NOT Needed

Dev/test environments — RF=1 is acceptable
High-volume metrics/logs where data loss is tolerable — RF=1 or 2
Resource constraints — each replica = xN disk space

Middle Level

Leader and Followers

Partition 0:
  Broker 1 → Leader (read/write)
  Broker 2 → Follower (read-only, replicates from leader)
  Broker 3 → Follower (read-only, replicates from leader)

All writes go through the leader
Followers passively copy data

ISR (In-Sync Replicas)

ISR — a list of replicas that are synchronized with the leader

Example:
  Leader: Broker 1
  ISR: [Broker 1, Broker 2, Broker 3]

If Broker 3 falls behind:
  ISR: [Broker 1, Broker 2]

min.insync.replicas

Minimum number of replicas in ISR required for a write

min.insync.replicas=2:
  acks=all + at least 2 replicas in ISR
  If ISR < 2 → write is rejected

Failover

Leader goes down:
  Controller broker selects a new leader from ISR
  Automatic process
  Consumers and producers switch over

Common Mistakes

Too small replication factor:

replication-factor=1 → no fault tolerance
If the broker goes down, data is lost

min.insync.replicas=1 with acks=all:

No protection against data loss
Leader goes down → data lost

Senior Level

Leader Election

Controller Broker:

One of the brokers is the cluster controller
Responsible for:
- Selecting leaders for partitions
- Managing ISR
- Handling failover

If the controller goes down → a new controller is elected

Leader Election Process:

Leader goes down
Controller detects the failure
Selects a new leader from ISR
Updates metadata
Producers and consumers switch over

ISR Management

Expanding ISR:

Follower catches up with the leader → added to ISR

Shrinking ISR:

Follower falls behind → removed from ISR
Criterion: replica.lag.time.max.ms (30 seconds by default)

Cross-Datacenter Replication

MirrorMaker 2:

(available since Kafka 2.4, replaces the legacy MirrorMaker 1)
Replication between data centers
Active-passive or active-active topology
Asynchronous replication

Considerations:

- Network latency between DCs
- Bandwidth costs
- RPO (Recovery Point Objective)
- RTO (Recovery Time Objective)

Production Configuration

# Topic level
replication-factor: 3
min.insync.replicas: 2

# Broker level
unclean.leader.election.enable: false
replica.lag.time.max.ms: 30000

Best Practices

✅ RF=3 for production
✅ min.insync.replicas=2
✅ acks=all for critical data
✅ unclean.leader.election.enable=false
✅ Even leader distribution
✅ Monitor ISR shrink/expansion

❌ RF=1 for production
❌ min.insync=1 with acks=all
❌ Unclean leader election
❌ Ignoring ISR shrink
❌ Single data center for critical data

Architectural Decisions

RF=3 — production standard
min.insync.replicas=2 — balance between availability and durability
Cross-DC replication — for disaster recovery
ISR monitoring — early indicator of problems

Summary for Senior

Replication is the foundation of Kafka fault tolerance
ISR is a critical mechanism for consistency
Leader election from ISR guarantees no data loss
Cross-DC replication for disaster recovery
ISR health monitoring is mandatory for production

🎯 Interview Cheat Sheet

Must know:

Replication = copying partition data to multiple brokers for fault tolerance
Replication Factor (RF): RF=3 is the production standard, tolerates up to 2 broker failures
Leader accepts read/write, Followers passively copy data
ISR (In-Sync Replicas) — replicas synchronized with the leader
min.insync.replicas=2 at RF=3: write is rejected if ISR < 2
unclean.leader.election.enable=false — leader only from ISR, otherwise data loss
When the leader goes down: Controller selects a new leader from ISR

Common follow-up questions:

What happens with RF=1? — No fault tolerance; if the broker goes down, data is lost.
Why min.insync.replicas=2? — Protection against writing to a single copy (leader down = data loss).
What is unclean leader election? — Leader from a non-ISR follower → data loss.
How does cross-DC replication work? — MirrorMaker 2: asynchronous replication between data centers.

Red flags (DO NOT say):

“RF=1 is enough for production” — no fault tolerance
“All replicas accept writes” — only the Leader; Followers are read-only
min.insync.replicas=1 with acks=all — no protection against data loss
“Unclean leader election is safe” — data is lost

Related topics:

[[17. What are leader and follower replicas]]
[[18. What is ISR (In-Sync Replicas)]]
[[19. How does Kafka ensure fault tolerance]]
[[20. What is producer acknowledgment and what modes exist (acks=0,1,all)]]