What is ISR (In-Sync Replicas)
A replica is considered in-sync if:
Junior Level
Definition
ISR (In-Sync Replicas) — a list of partition replicas that are fully synchronized with the leader.
Partition 0:
Leader: Broker 1
ISR: [Broker 1, Broker 2, Broker 3]
All three brokers have up-to-date data
Why is ISR needed?
1. Selecting a new leader on failover
— Only from ISR (guarantees no data loss)
2. Write acknowledgment
— acks=all waits for acknowledgment from all ISR
How does a replica enter ISR?
A replica is included in ISR if:
1. The broker is active and sends heartbeats
2. It is not behind the leader by more than 30 seconds (by default)
Example
# Check ISR for a topic
kafka-topics.sh --describe --topic orders --bootstrap-server localhost:9092
Topic: orders Partition: 0 Leader: 1 Replicas: 1,2,3 ISR: 1,2,3
The numbers 1,2,3 are broker IDs in the cluster. ISR: 1,2,3 means all three
brokers are synchronized.
Middle Level
Definition of “In-Sync”
A replica is considered in-sync if:
1. Liveness: Broker is active and maintains connection
2. Freshness: Not behind by more than replica.lag.time.max.ms (30 sec)
ISR and Leader Election
When the leader goes down:
A new leader is selected only from ISR
→ Guarantees the new leader has all committed data
→ Data loss is impossible
ISR and Write Acknowledgment
acks=all + min.insync.replicas=2:
Producer → Leader → waits for acknowledgment from all ISR
If ISR < 2 → write is rejected
Shrinking and Expanding ISR
Shrinking:
Follower slows down → removed from ISR
Expanding:
Follower catches up → returns to ISR
min.insync.replicas
Minimum acceptable ISR size for accepting writes
Example:
min.insync.replicas=2
ISR=[Leader, Follower1] → write is allowed
ISR=[Leader] → write is rejected (NotEnoughReplicasException)
Common Mistakes
- ISR Shrinking without investigation:
Replicas leave ISR → risk of data loss → Need to monitor and fix the root cause - min.insync.replicas=1:
No protection — a single leader can lose data
Senior Level
Internal Implementation
ISR Tracking:
The leader tracks each replica:
- Last Fetch Time — time of the last fetch request
- Last Caught Up Time — when it caught up with the leader
- Log End Offset — the latest offset on the replica
ISR inclusion criterion:
(now - lastCaughtUpTime) <= replica.lag.time.max.ms
Historical Note:
Before Kafka 0.9: replica.lag.max.messages (message count limit)
After Kafka 0.9: replica.lag.time.max.ms (time-based limit)
Reason: message count limit is inefficient for high-throughput systems
ISR and acks=all
acks=all waits for acknowledgment from all ISR replicas
Scenario:
ISR=[1,2,3], min.insync.replicas=2
Producer → acks=all → waits for 1,2,3
If 3 leaves ISR → ISR=[1,2] → still works
Problem scenario:
ISR=[1,2,3], min.insync.replicas=2
2 and 3 leave ISR → ISR=[1]
Write is rejected (NotEnoughReplicasException)
Unclean Leader Election
If ISR is empty (all in-sync replicas are down):
unclean.leader.election.enable=false (default):
→ Wait for any replica from ISR to recover
→ System is unavailable for writes
→ Data is preserved
unclean.leader.election.enable=true:
→ Select any live follower
→ System becomes available
→ Data that wasn't replicated is lost
Monitoring ISR
Key metrics:
kafka.server:IsrShrinksPerSec
kafka.server:IsrExpandsPerSec
kafka.server:UnderReplicatedPartitions
kafka.server:PartitionCount (ISR size)
Alerts:
- ISR shrinks per sec > threshold → warning
- Under-replicated partitions > 0 → warning
- ISR size < replication factor → critical
- Replica lag > 30s → critical
ISR Troubleshooting
Causes of shrinking:
1. Network issues between brokers
2. Broker overload (CPU, disk, memory)
3. Slow disk on a follower
4. High data volume (follower can't keep up)
Solutions:
1. Check network connectivity
2. Monitor broker resources
3. Increase replica.fetch.max.bytes
4. Increase replica.lag.time.max.ms (temporary fix)
Production Configuration
# Broker level
replica.lag.time.max.ms: 30000
num.replica.fetchers: 1
# Topic level
replication.factor: 3
min.insync.replicas: 2
unclean.leader.election.enable: false
Best Practices
✅ ISR monitoring (shrinks/expands)
✅ min.insync.replicas=2 at RF=3
✅ unclean.leader.election.enable=false
✅ Alert on under-replicated partitions
✅ Even replica distribution
✅ Monitor replica lag
❌ Ignoring ISR shrink
❌ min.insync.replicas=1 for production
❌ unclean.leader.election.enable=true
❌ Without monitoring replica lag
❌ RF < 3 for production
Architectural Decisions
- ISR — no-loss guarantee — leader only from ISR
- min.insync.replicas=2 — balance between availability and durability
- Monitoring ISR shrink — early indicator of problems
- Unclean election = data loss — avoid in production
Summary for Senior
- ISR is a critical mechanism for Kafka consistency
- A replica enters ISR based on time lag, not message count
- ISR includes the Leader itself — the leader is always in ISR since it is the source of truth and doesn’t need to copy data.
- min.insync.replicas protects against writing to a single copy
- IsrShrinksPerSec is the key monitoring metric
- Unclean leader election is a last-resort measure with data loss risk
🎯 Interview Cheat Sheet
Must know:
- ISR — a list of replicas fully synchronized with the leader (including the leader itself)
- A replica is in ISR if: active and not behind by more than
replica.lag.time.max.ms(30s) - A new leader is selected ONLY from ISR → guarantees no data loss
acks=allwaits for acknowledgment from all ISR replicasmin.insync.replicas=2at RF=3: write is rejected if ISR < 2- Shrinking: follower falls behind → removed; Expanding: catches up → returns
- Before Kafka 0.9: message count limit; after: time-based limit (more efficient for high-throughput)
Common follow-up questions:
- Is the leader part of ISR? — Yes, the leader is always in ISR — it is the source of truth.
- What happens if ISR is empty? — With unclean.leader.election.enable=false — system is unavailable; if true — any follower (data loss).
- Why time-based instead of message-based? — Message count limit is inefficient for high-throughput.
- What causes ISR shrinking? — Network issues, broker overload, slow disk, high data volume.
Red flags (DO NOT say):
- “ISR only includes followers” — the leader is always in ISR
min.insync.replicas=1 for production— no protection against data loss- “ISR shrinking can be ignored” — early indicator of problems
- “Unclean leader election is safe” — data loss is guaranteed
Related topics:
- [[16. What is replication in Kafka]]
- [[17. What are leader and follower replicas]]
- [[19. How does Kafka ensure fault tolerance]]
- [[20. What is producer acknowledgment and what modes exist (acks=0,1,all)]]