What is partition and why is it needed
Think of a post office with multiple service windows:
π’ Junior Level
What is a partition?
Partition is a βsub-channelβ within a topic. Each partition is an ordered, immutable sequence of messages that is continuously appended to.
Why partitions: the parallelism mechanism. One partition = one consumer can read it. If a topic were a single unit, only one consumer could work. Partitions allow N consumers to read one topic simultaneously.
Analogy
Think of a post office with multiple service windows:
- Topic is the entire post office
- Partitions are individual service windows
- Each window serves its own queue of clients independently
- Clients within one queue are served strictly in order
Topic "orders"
βββββββββββββββ¬ββββββββββββββ¬ββββββββββββββ
βPartition 0 βPartition 1 βPartition 2 β
β msg-0 β msg-0 β msg-0 β
β msg-1 β msg-1 β msg-1 β
β msg-2 β msg-2 β msg-2 β
βββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ
Why are partitions needed?
- Parallelism β each partition is read by one consumer. 10 partitions = 10 consumers working in parallel
- Scalability β partitions are distributed across different brokers
- Fault tolerance β each partition is replicated to multiple brokers
Simple example
# Create topic with 3 partitions
kafka-topics.sh --create \
--topic orders \
--partitions 3 \
--replication-factor 3 \
--bootstrap-server localhost:9092
3 partitions, 3 consumers:
Consumer 1 β reads Partition 0
Consumer 2 β reads Partition 1
Consumer 3 β reads Partition 2
π‘ Middle Level
Partition anatomy
Offset β the sequential number of each message within a partition. Unique only within a single partition.
Partition 0:
offset=0 {"userId": 1, "action": "login"}
offset=1 {"userId": 2, "action": "purchase"}
offset=2 {"userId": 1, "action": "logout"}
Ordering β strict ordering only within a single partition. There is no global ordering across a topic.
Immutability β messages cannot be modified or selectively deleted. Deletion happens in whole segments.
Partition replication
Partition 0 on a 3-broker cluster:
Broker A β Leader (handles read/write)
Broker B β Follower (copies from Leader)
Broker C β Follower (copies from Leader)
- Reads and writes always go through the leader
- Followers passively copy data
- If the leader fails, one of the followers becomes the new leader
ISR (In-Sync Replicas)
ISR = set of replicas that have "caught up" with the leader
Leader: offset=100
Follower B: offset=100 β in ISR
Follower C: offset=95 β NOT in ISR (lagging)
ISR = {Leader, Follower B}
Choosing the number of partitions
| Factor | Recommendation |
|---|---|
| Producer throughput | partitions >= producer_throughput / per_partition_throughput |
| Consumer throughput | partitions >= consumer_throughput / per_consumer_throughput |
| Growth margin | Multiply by 2-3 |
How data is distributed across partitions
partition = hash(key) % numPartitions // if key exists
partition = round-robin / sticky // if no key
Common mistakes
| Mistake | Consequence | Solution |
|---|---|---|
| Single partition | No parallelism | At least 3-6 partitions |
| Too many partitions | Controller load, memory, FD exhaustion | Recommended no more than 2000-4000 per broker for stability. Technical limit β ~200,000 (KRaft), but controller performance degrades. |
| Adding partitions βon the flyβ | Key ordering broken | Plan ahead |
| Uneven distribution | Hot partition | Check key distribution |
Adding partitions
kafka-topics.sh --alter --topic orders --partitions 10
Consequences:
- Old data is not redistributed β stays in old partitions
- New messages go to all partitions
hash(key) % Nchanges β same keys land in different partitions β ordering broken
π΄ Senior Level
Internal partition structure
At the Kafka code level, each partition is represented by a Partition object in ReplicaManager. Physically, a partition is a set of segment files on disk:
/var/kafka-logs/orders-0/
βββ 00000000000000000000.log # Segment 0, data
βββ 00000000000000000000.index # Segment 0, sparse index (offset β physical position)
βββ 00000000000000000000.timeindex # Segment 0, time-based index (timestamp β offset)
βββ 00000000000000000000.txnindex # Segment 0, transaction index
βββ 00000000000000100000.log # Segment 1 (rotated at segment.bytes)
βββ 00000000000000100000.index
βββ 00000000000000100000.timeindex
βββ leader-epoch-checkpoint # For preventing data loss on leader change
βββ __txn_index__.0 # Transaction metadata (if transactions enabled)
Log Segments β details
segment.bytes=1GB # Rotation by size
segment.ms=7 days # Rotation by time
segment.jitter.ms=0 # Random delay to avoid thundering herd
index.interval.bytes=4096 # Index write frequency
Sparse Index: Index not for every message, but every index.interval.bytes bytes. This keeps the index compact (~1 entry per 4KB of data). Search: binary search on index β linear scan within segment.
Leader/Follower Replication Protocol
Replica Fetcher Thread (on each Follower):
1. Send FetchRequest β Leader
2. Leader returns batch with current leader epoch
3. Follower writes to its log
4. Updates LEO (Log End Offset)
5. Leader updates HW (High Watermark)
High Watermark (HW) = last offset confirmed by all ISR
Messages with offset > HW β not visible to consumers
LEO (Log End Offset) β number of the last written message.
HW (High Watermark) β number of the last message confirmed by all replicas. Consumers only see messages up to HW.
Leader Epoch: Introduced in Kafka 0.11 to eliminate data loss on leader change. Stored in leader-epoch-checkpoint file. On failover, the new leader truncates its log to the last confirmed epoch.
ISR (In-Sync Replicas) β deep dive
Conditions for being in ISR:
replica.lag.time.max.ms=30000 // Follower must "catch up" within 30 seconds
replica.fetch.wait.max.ms=500 // Max wait for fetch request
Replica State Machine:
NewReplica β OnlineReplica β Leader/Follower β OfflineReplica β NonExistentReplica
OSR (Out-of-Sync Replicas): Replicas lagging behind the leader. Not counted towards min.insync.replicas.
KRaft Mode (Kafka Raft Metadata)
Before KRaft: ZooKeeper stored cluster metadata (topics, partitions, controller, ISR).
With KRaft (Kafka 3.3+ production-ready):
Controller Quorum (odd number, min 3 nodes):
- Stores metadata in __cluster_metadata topic
- Raft consensus protocol instead of ZK
- Eliminates ZK as SPOF
- Faster metadata change processing
- Supports up to 200,000 partitions (vs 200,000 with ZK with degradation)
metadata.log.segment.bytes=100MB
metadata.log.segment.ms=1 day
metadata.max.retention.bytes=100MB
Edge Cases (3+)
-
Unclean Leader Election: With
unclean.leader.election.enable=true, an out-of-sync replica can become leader β data loss (messages between the old leaderβs HW and new leaderβs LEO will be lost). For financial systems, always set tofalse. -
Split-Brain with KRaft: If the controller quorum loses majority (e.g., 2 out of 5 nodes are down), the cluster cannot change metadata (create topics, elect leaders). Reading/writing existing partitions continues to work.
-
Log Divergence on Network Partition: If leader and follower are network-partitioned, the leader continues writing. On network recovery, the follower truncates to the leaderβs HW. If HW hasnβt been updated yet β producer-confirmed messages may be lost (with
acks=1). -
Metadata Propagation Delay: In large clusters (10K+ partitions), metadata updates (new topic, new partition) can take 30-60 seconds. Producers/consumers receive stale metadata and send messages to wrong brokers (NotLeaderForPartitionException).
-
Disk Failure on Single Replica: If a disk with a partition fails, ISR shrinks. If
min.insync.replicasis not met β producers withacks=allreceiveNotEnoughReplicasException. Kafka does not automatically restore data from other brokers β manual intervention is required.
Performance Numbers
| Metric | Value | Conditions |
|---|---|---|
| Throughput per partition | ~50-100 MB/s | SSD, batch.size=256KB, lz4 |
| Max partitions/broker | ~200,000 | KRaft, 64GB RAM, SSD |
| Recommended partitions/broker | 2,000-4,000 | For stable operation |
| Replication lag (normal) | < 100 ms | Within a single AZ |
| Leader election time | 5-30 seconds | Depends on unclean.leader.election |
| Segment flush interval | log.flush.interval.messages=9223372036854775807 (OS flush) |
Β |
Production War Story
Situation: E-commerce platform with 50 partitions on topic
orders. After adding 50 more partitions (total 100) for scaling, clients started complaining about βlostβ orders and duplicate processing.Root cause: When increasing partitions,
hash(key) % 50changed tohash(key) % 100. Events for the same order landed in different partitions. The order processing service collected events by key, but now βorder createdβ was in partition 23, while βpaymentβ was in partition 73. The payment handler couldnβt find the original order.Additional problem: 100 partitions Γ 3 replicas Γ 5 brokers = 600 file descriptors for just one topic.
Too many open fileserrors started occurring.Solution:
- Created new topic
orders-v2with 100 partitions- Dual-write: producers wrote to both topics for 72 hours β producers sent messages to both old and new topic simultaneously, for 72 hours so consumers could finish reading old data from the first and start reading new from the second.
- Consumers switched to
orders-v2- Increased
ulimit -nfrom 4096 to 65536- Enabled
log.retention.check.interval.ms=300000for timely segment cleanupLesson: Never add partitions to a topic if you use keys for ordering guarantees. Create a new topic and migrate traffic.
Monitoring (JMX + Burrow)
Partition JMX metrics:
kafka.cluster:type=Partition,name=UnderReplicated
kafka.cluster:type=Partition,name=InSyncReplicasCount
kafka.server:type=ReplicaManager,name=IsrShrinksPerSec
kafka.server:type=ReplicaManager,name=IsrExpandsPerSec
kafka.log:type=Log,name=LogEndOffset,partition=0
kafka.log:type=Log,name=LogStartOffset,partition=0
kafka.log:type=Log,name=Size,partition=0
Burrow (consumer-centric):
- Lag per partition β granularity down to partition level
- Status: OK, WARN, ERR, STOP, STALL
- HTTP API β Grafana β PagerDuty
Highload Best Practices
- Plan partitions ahead β they cannot be reduced without recreating the topic
- Formula:
partitions = max(prod_throughput, cons_throughput) / per_partition_capacity * 1.5 - No more than 4000 partitions per broker β otherwise controller degradation
- Use KRaft for new clusters β faster metadata propagation, no ZK dependency
- Monitor ISR Shrink/Expand β indicator of replication problems
- Segment sizing: set
segment.bytesso a segment closes in 1-4 hours - OS tuning:
vm.dirty_background_ratio=5,vm.dirty_ratio=10for flush control - Disk: SSD is mandatory for production. NVMe for high-throughput (>50 MB/s per partition)
π― Interview Cheat Sheet
Must know:
- Partition β parallelism mechanism: 1 partition = 1 consumer
- Ordering is strictly guaranteed only within a single partition (FIFO)
- Offset β sequential message number, unique only within a partition
- Each partition is replicated: Leader (read/write) + Followers (copy)
- ISR β replicas synchronized with the leader; leader is chosen only from ISR
hash(key) % Nβ when adding partitions, key distribution changes- Reducing partitions is impossible; adding partitions breaks key ordering
- Recommended maximum: 2000-4000 partitions per broker
Common follow-up questions:
- What happens when you add partitions? β Old data is not redistributed; keys will land in different partitions β ordering broken.
- What is High Watermark? β Last offset confirmed by all ISR. Consumers only see up to HW.
- Can you reduce partitions? β No, only delete and recreate the topic.
- What is Leader Epoch? β Leader generation number; prevents data loss on leader change.
Red flags (DO NOT say):
- βOrdering is guaranteed across the whole topicβ β only within a partition
- βYou can reduce partitionsβ β impossible
- βOffset is globally uniqueβ β unique only within a partition
- βAdding partitions is safe when using keysβ β breaks ordering
Related topics:
- [[1. What is topic in Kafka]]
- [[3. How is data distributed across partitions]]
- [[17. What are leader and follower replicas]]
- [[18. What is ISR (In-Sync Replicas)]]
- [[28. How are old messages deleted from a topic]]