Question 16 · Section 15

What is replication in Kafka

4. ISR monitoring — early indicator of problems

Language versions: English Russian Ukrainian

Junior Level

Definition

Replication is the copying of partition data across multiple brokers for fault tolerance.

Partition 0:
  Broker A → Leader (accepts writes)
  Broker B → Follower (copy)
  Broker C → Follower (copy)

Why is replication needed?

If the leader goes down:
  One of the followers becomes the new leader
  Data is not lost (assuming unclean.leader.election.enable=false and sync replicas exist).
  Consumers continue reading

Replication Factor

replication-factor=3 → 3 copies on different brokers
replication-factor=1 → no fault tolerance

Example: Creating a Topic

kafka-topics.sh --create \
  --topic orders \
  --partitions 3 \
  --replication-factor 3 \
  --bootstrap-server localhost:9092

When a High Replication Factor is NOT Needed

  1. Dev/test environments — RF=1 is acceptable
  2. High-volume metrics/logs where data loss is tolerable — RF=1 or 2
  3. Resource constraints — each replica = xN disk space

Middle Level

Leader and Followers

Partition 0:
  Broker 1 → Leader (read/write)
  Broker 2 → Follower (read-only, replicates from leader)
  Broker 3 → Follower (read-only, replicates from leader)

All writes go through the leader
Followers passively copy data

ISR (In-Sync Replicas)

ISR — a list of replicas that are synchronized with the leader

Example:
  Leader: Broker 1
  ISR: [Broker 1, Broker 2, Broker 3]

If Broker 3 falls behind:
  ISR: [Broker 1, Broker 2]

min.insync.replicas

Minimum number of replicas in ISR required for a write

min.insync.replicas=2:
  acks=all + at least 2 replicas in ISR
  If ISR < 2 → write is rejected

Failover

Leader goes down:
  Controller broker selects a new leader from ISR
  Automatic process
  Consumers and producers switch over

Common Mistakes

  1. Too small replication factor:
    replication-factor=1 → no fault tolerance
    If the broker goes down, data is lost
    
  2. min.insync.replicas=1 with acks=all:
    No protection against data loss
    Leader goes down → data lost
    

Senior Level

Leader Election

Controller Broker:

One of the brokers is the cluster controller
Responsible for:
- Selecting leaders for partitions
- Managing ISR
- Handling failover

If the controller goes down → a new controller is elected

Leader Election Process:

1. Leader goes down
2. Controller detects the failure
3. Selects a new leader from ISR
4. Updates metadata
5. Producers and consumers switch over

ISR Management

Expanding ISR:

Follower catches up with the leader → added to ISR

Shrinking ISR:

Follower falls behind → removed from ISR
Criterion: replica.lag.time.max.ms (30 seconds by default)

Cross-Datacenter Replication

MirrorMaker 2:

(available since Kafka 2.4, replaces the legacy MirrorMaker 1)
Replication between data centers
Active-passive or active-active topology
Asynchronous replication

Considerations:

- Network latency between DCs
- Bandwidth costs
- RPO (Recovery Point Objective)
- RTO (Recovery Time Objective)

Production Configuration

# Topic level
replication-factor: 3
min.insync.replicas: 2

# Broker level
unclean.leader.election.enable: false
replica.lag.time.max.ms: 30000

Best Practices

✅ RF=3 for production
✅ min.insync.replicas=2
✅ acks=all for critical data
✅ unclean.leader.election.enable=false
✅ Even leader distribution
✅ Monitor ISR shrink/expansion

❌ RF=1 for production
❌ min.insync=1 with acks=all
❌ Unclean leader election
❌ Ignoring ISR shrink
❌ Single data center for critical data

Architectural Decisions

  1. RF=3 — production standard
  2. min.insync.replicas=2 — balance between availability and durability
  3. Cross-DC replication — for disaster recovery
  4. ISR monitoring — early indicator of problems

Summary for Senior

  • Replication is the foundation of Kafka fault tolerance
  • ISR is a critical mechanism for consistency
  • Leader election from ISR guarantees no data loss
  • Cross-DC replication for disaster recovery
  • ISR health monitoring is mandatory for production

🎯 Interview Cheat Sheet

Must know:

  • Replication = copying partition data to multiple brokers for fault tolerance
  • Replication Factor (RF): RF=3 is the production standard, tolerates up to 2 broker failures
  • Leader accepts read/write, Followers passively copy data
  • ISR (In-Sync Replicas) — replicas synchronized with the leader
  • min.insync.replicas=2 at RF=3: write is rejected if ISR < 2
  • unclean.leader.election.enable=false — leader only from ISR, otherwise data loss
  • When the leader goes down: Controller selects a new leader from ISR

Common follow-up questions:

  • What happens with RF=1? — No fault tolerance; if the broker goes down, data is lost.
  • Why min.insync.replicas=2? — Protection against writing to a single copy (leader down = data loss).
  • What is unclean leader election? — Leader from a non-ISR follower → data loss.
  • How does cross-DC replication work? — MirrorMaker 2: asynchronous replication between data centers.

Red flags (DO NOT say):

  • “RF=1 is enough for production” — no fault tolerance
  • “All replicas accept writes” — only the Leader; Followers are read-only
  • min.insync.replicas=1 with acks=all — no protection against data loss
  • “Unclean leader election is safe” — data is lost

Related topics:

  • [[17. What are leader and follower replicas]]
  • [[18. What is ISR (In-Sync Replicas)]]
  • [[19. How does Kafka ensure fault tolerance]]
  • [[20. What is producer acknowledgment and what modes exist (acks=0,1,all)]]