Question 18 · Section 15

What is ISR (In-Sync Replicas)

A replica is considered in-sync if:

Language versions: English Russian Ukrainian

Junior Level

Definition

ISR (In-Sync Replicas) — a list of partition replicas that are fully synchronized with the leader.

Partition 0:
  Leader: Broker 1
  ISR: [Broker 1, Broker 2, Broker 3]

All three brokers have up-to-date data

Why is ISR needed?

1. Selecting a new leader on failover
   — Only from ISR (guarantees no data loss)

2. Write acknowledgment
   — acks=all waits for acknowledgment from all ISR

How does a replica enter ISR?

A replica is included in ISR if:
1. The broker is active and sends heartbeats
2. It is not behind the leader by more than 30 seconds (by default)

Example

# Check ISR for a topic
kafka-topics.sh --describe --topic orders --bootstrap-server localhost:9092

Topic: orders  Partition: 0  Leader: 1  Replicas: 1,2,3  ISR: 1,2,3

The numbers 1,2,3 are broker IDs in the cluster. ISR: 1,2,3 means all three
brokers are synchronized.

Middle Level

Definition of “In-Sync”

A replica is considered in-sync if:

1. Liveness: Broker is active and maintains connection
2. Freshness: Not behind by more than replica.lag.time.max.ms (30 sec)

ISR and Leader Election

When the leader goes down:
  A new leader is selected only from ISR
  → Guarantees the new leader has all committed data
  → Data loss is impossible

ISR and Write Acknowledgment

acks=all + min.insync.replicas=2:
  Producer → Leader → waits for acknowledgment from all ISR
  If ISR < 2 → write is rejected

Shrinking and Expanding ISR

Shrinking:
  Follower slows down → removed from ISR

Expanding:
  Follower catches up → returns to ISR

min.insync.replicas

Minimum acceptable ISR size for accepting writes

Example:
  min.insync.replicas=2
  ISR=[Leader, Follower1] → write is allowed
  ISR=[Leader] → write is rejected (NotEnoughReplicasException)

Common Mistakes

  1. ISR Shrinking without investigation:
    Replicas leave ISR → risk of data loss
    → Need to monitor and fix the root cause
    
  2. min.insync.replicas=1:
    No protection — a single leader can lose data
    

Senior Level

Internal Implementation

ISR Tracking:

The leader tracks each replica:
- Last Fetch Time — time of the last fetch request
- Last Caught Up Time — when it caught up with the leader
- Log End Offset — the latest offset on the replica

ISR inclusion criterion:
  (now - lastCaughtUpTime) <= replica.lag.time.max.ms

Historical Note:

Before Kafka 0.9: replica.lag.max.messages (message count limit)
After Kafka 0.9: replica.lag.time.max.ms (time-based limit)

Reason: message count limit is inefficient for high-throughput systems

ISR and acks=all

acks=all waits for acknowledgment from all ISR replicas

Scenario:
  ISR=[1,2,3], min.insync.replicas=2
  Producer → acks=all → waits for 1,2,3
  If 3 leaves ISR → ISR=[1,2] → still works

Problem scenario:
  ISR=[1,2,3], min.insync.replicas=2
  2 and 3 leave ISR → ISR=[1]
  Write is rejected (NotEnoughReplicasException)

Unclean Leader Election

If ISR is empty (all in-sync replicas are down):

unclean.leader.election.enable=false (default):
  → Wait for any replica from ISR to recover
  → System is unavailable for writes
  → Data is preserved

unclean.leader.election.enable=true:
  → Select any live follower
  → System becomes available
  → Data that wasn't replicated is lost

Monitoring ISR

Key metrics:

kafka.server:IsrShrinksPerSec
kafka.server:IsrExpandsPerSec
kafka.server:UnderReplicatedPartitions
kafka.server:PartitionCount (ISR size)

Alerts:

- ISR shrinks per sec > threshold → warning
- Under-replicated partitions > 0 → warning
- ISR size < replication factor → critical
- Replica lag > 30s → critical

ISR Troubleshooting

Causes of shrinking:

1. Network issues between brokers
2. Broker overload (CPU, disk, memory)
3. Slow disk on a follower
4. High data volume (follower can't keep up)

Solutions:

1. Check network connectivity
2. Monitor broker resources
3. Increase replica.fetch.max.bytes
4. Increase replica.lag.time.max.ms (temporary fix)

Production Configuration

# Broker level
replica.lag.time.max.ms: 30000
num.replica.fetchers: 1

# Topic level
replication.factor: 3
min.insync.replicas: 2
unclean.leader.election.enable: false

Best Practices

✅ ISR monitoring (shrinks/expands)
✅ min.insync.replicas=2 at RF=3
✅ unclean.leader.election.enable=false
✅ Alert on under-replicated partitions
✅ Even replica distribution
✅ Monitor replica lag

❌ Ignoring ISR shrink
❌ min.insync.replicas=1 for production
❌ unclean.leader.election.enable=true
❌ Without monitoring replica lag
❌ RF < 3 for production

Architectural Decisions

  1. ISR — no-loss guarantee — leader only from ISR
  2. min.insync.replicas=2 — balance between availability and durability
  3. Monitoring ISR shrink — early indicator of problems
  4. Unclean election = data loss — avoid in production

Summary for Senior

  • ISR is a critical mechanism for Kafka consistency
  • A replica enters ISR based on time lag, not message count
  • ISR includes the Leader itself — the leader is always in ISR since it is the source of truth and doesn’t need to copy data.
  • min.insync.replicas protects against writing to a single copy
  • IsrShrinksPerSec is the key monitoring metric
  • Unclean leader election is a last-resort measure with data loss risk

🎯 Interview Cheat Sheet

Must know:

  • ISR — a list of replicas fully synchronized with the leader (including the leader itself)
  • A replica is in ISR if: active and not behind by more than replica.lag.time.max.ms (30s)
  • A new leader is selected ONLY from ISR → guarantees no data loss
  • acks=all waits for acknowledgment from all ISR replicas
  • min.insync.replicas=2 at RF=3: write is rejected if ISR < 2
  • Shrinking: follower falls behind → removed; Expanding: catches up → returns
  • Before Kafka 0.9: message count limit; after: time-based limit (more efficient for high-throughput)

Common follow-up questions:

  • Is the leader part of ISR? — Yes, the leader is always in ISR — it is the source of truth.
  • What happens if ISR is empty? — With unclean.leader.election.enable=false — system is unavailable; if true — any follower (data loss).
  • Why time-based instead of message-based? — Message count limit is inefficient for high-throughput.
  • What causes ISR shrinking? — Network issues, broker overload, slow disk, high data volume.

Red flags (DO NOT say):

  • “ISR only includes followers” — the leader is always in ISR
  • min.insync.replicas=1 for production — no protection against data loss
  • “ISR shrinking can be ignored” — early indicator of problems
  • “Unclean leader election is safe” — data loss is guaranteed

Related topics:

  • [[16. What is replication in Kafka]]
  • [[17. What are leader and follower replicas]]
  • [[19. How does Kafka ensure fault tolerance]]
  • [[20. What is producer acknowledgment and what modes exist (acks=0,1,all)]]