What is partition and why is it needed

🟢 Junior Level

What is a partition?

Partition is a “sub-channel” within a topic. Each partition is an ordered, immutable sequence of messages that is continuously appended to.

Why partitions: the parallelism mechanism. One partition = one consumer can read it. If a topic were a single unit, only one consumer could work. Partitions allow N consumers to read one topic simultaneously.

Analogy

Think of a post office with multiple service windows:

Topic is the entire post office
Partitions are individual service windows
Each window serves its own queue of clients independently
Clients within one queue are served strictly in order

Topic "orders"
┌─────────────┬─────────────┬─────────────┐
│Partition 0  │Partition 1  │Partition 2  │
│ msg-0       │ msg-0       │ msg-0       │
│ msg-1       │ msg-1       │ msg-1       │
│ msg-2       │ msg-2       │ msg-2       │
└─────────────┴─────────────┴─────────────┘

Why are partitions needed?

Parallelism — each partition is read by one consumer. 10 partitions = 10 consumers working in parallel
Scalability — partitions are distributed across different brokers
Fault tolerance — each partition is replicated to multiple brokers

Simple example

# Create topic with 3 partitions
kafka-topics.sh --create \
  --topic orders \
  --partitions 3 \
  --replication-factor 3 \
  --bootstrap-server localhost:9092

3 partitions, 3 consumers:
  Consumer 1 → reads Partition 0
  Consumer 2 → reads Partition 1
  Consumer 3 → reads Partition 2

🟡 Middle Level

Partition anatomy

Offset — the sequential number of each message within a partition. Unique only within a single partition.

Partition 0:
  offset=0  {"userId": 1, "action": "login"}
  offset=1  {"userId": 2, "action": "purchase"}
  offset=2  {"userId": 1, "action": "logout"}

Ordering — strict ordering only within a single partition. There is no global ordering across a topic.

Immutability — messages cannot be modified or selectively deleted. Deletion happens in whole segments.

Partition replication

Partition 0 on a 3-broker cluster:
  Broker A → Leader (handles read/write)
  Broker B → Follower (copies from Leader)
  Broker C → Follower (copies from Leader)

Reads and writes always go through the leader
Followers passively copy data
If the leader fails, one of the followers becomes the new leader

ISR (In-Sync Replicas)

ISR = set of replicas that have "caught up" with the leader

Leader: offset=100
Follower B: offset=100  → in ISR
Follower C: offset=95   → NOT in ISR (lagging)

ISR = {Leader, Follower B}

Choosing the number of partitions

Factor	Recommendation
Producer throughput	`partitions >= producer_throughput / per_partition_throughput`
Consumer throughput	`partitions >= consumer_throughput / per_consumer_throughput`
Growth margin	Multiply by 2-3

How data is distributed across partitions

partition = hash(key) % numPartitions    // if key exists
partition = round-robin / sticky         // if no key

Common mistakes

Mistake	Consequence	Solution
Single partition	No parallelism	At least 3-6 partitions
Too many partitions	Controller load, memory, FD exhaustion	Recommended no more than 2000-4000 per broker for stability. Technical limit — ~200,000 (KRaft), but controller performance degrades.
Adding partitions “on the fly”	Key ordering broken	Plan ahead
Uneven distribution	Hot partition	Check key distribution

Adding partitions

kafka-topics.sh --alter --topic orders --partitions 10

Consequences:

Old data is not redistributed — stays in old partitions
New messages go to all partitions
hash(key) % N changes → same keys land in different partitions → ordering broken

🔴 Senior Level

Internal partition structure

At the Kafka code level, each partition is represented by a Partition object in ReplicaManager. Physically, a partition is a set of segment files on disk:

/var/kafka-logs/orders-0/
├── 00000000000000000000.log          # Segment 0, data
├── 00000000000000000000.index        # Segment 0, sparse index (offset → physical position)
├── 00000000000000000000.timeindex    # Segment 0, time-based index (timestamp → offset)
├── 00000000000000000000.txnindex     # Segment 0, transaction index
├── 00000000000000100000.log          # Segment 1 (rotated at segment.bytes)
├── 00000000000000100000.index
├── 00000000000000100000.timeindex
├── leader-epoch-checkpoint           # For preventing data loss on leader change
└── __txn_index__.0                   # Transaction metadata (if transactions enabled)

Log Segments — details

segment.bytes=1GB          # Rotation by size
segment.ms=7 days          # Rotation by time
segment.jitter.ms=0        # Random delay to avoid thundering herd
index.interval.bytes=4096  # Index write frequency

Sparse Index: Index not for every message, but every index.interval.bytes bytes. This keeps the index compact (~1 entry per 4KB of data). Search: binary search on index → linear scan within segment.

Leader/Follower Replication Protocol

Replica Fetcher Thread (on each Follower):
  1. Send FetchRequest → Leader
  2. Leader returns batch with current leader epoch
  3. Follower writes to its log
  4. Updates LEO (Log End Offset)
  5. Leader updates HW (High Watermark)

High Watermark (HW) = last offset confirmed by all ISR
Messages with offset > HW — not visible to consumers

LEO (Log End Offset) — number of the last written message.
HW (High Watermark) — number of the last message confirmed by all replicas. Consumers only see messages up to HW.

Leader Epoch: Introduced in Kafka 0.11 to eliminate data loss on leader change. Stored in leader-epoch-checkpoint file. On failover, the new leader truncates its log to the last confirmed epoch.

ISR (In-Sync Replicas) — deep dive

Conditions for being in ISR:
  replica.lag.time.max.ms=30000  // Follower must "catch up" within 30 seconds
  replica.fetch.wait.max.ms=500  // Max wait for fetch request

Replica State Machine:
  NewReplica → OnlineReplica → Leader/Follower → OfflineReplica → NonExistentReplica

OSR (Out-of-Sync Replicas): Replicas lagging behind the leader. Not counted towards min.insync.replicas.

KRaft Mode (Kafka Raft Metadata)

Before KRaft: ZooKeeper stored cluster metadata (topics, partitions, controller, ISR).

With KRaft (Kafka 3.3+ production-ready):

Controller Quorum (odd number, min 3 nodes):
  - Stores metadata in __cluster_metadata topic
  - Raft consensus protocol instead of ZK
  - Eliminates ZK as SPOF
  - Faster metadata change processing
  - Supports up to 200,000 partitions (vs 200,000 with ZK with degradation)

metadata.log.segment.bytes=100MB
metadata.log.segment.ms=1 day
metadata.max.retention.bytes=100MB

Edge Cases (3+)

Unclean Leader Election: With unclean.leader.election.enable=true, an out-of-sync replica can become leader → data loss (messages between the old leader’s HW and new leader’s LEO will be lost). For financial systems, always set to false.
Split-Brain with KRaft: If the controller quorum loses majority (e.g., 2 out of 5 nodes are down), the cluster cannot change metadata (create topics, elect leaders). Reading/writing existing partitions continues to work.
Log Divergence on Network Partition: If leader and follower are network-partitioned, the leader continues writing. On network recovery, the follower truncates to the leader’s HW. If HW hasn’t been updated yet — producer-confirmed messages may be lost (with acks=1).
Metadata Propagation Delay: In large clusters (10K+ partitions), metadata updates (new topic, new partition) can take 30-60 seconds. Producers/consumers receive stale metadata and send messages to wrong brokers (NotLeaderForPartitionException).
Disk Failure on Single Replica: If a disk with a partition fails, ISR shrinks. If min.insync.replicas is not met — producers with acks=all receive NotEnoughReplicasException. Kafka does not automatically restore data from other brokers — manual intervention is required.

Performance Numbers

Metric	Value	Conditions
Throughput per partition	~50-100 MB/s	SSD, batch.size=256KB, lz4
Max partitions/broker	~200,000	KRaft, 64GB RAM, SSD
Recommended partitions/broker	2,000-4,000	For stable operation
Replication lag (normal)	< 100 ms	Within a single AZ
Leader election time	5-30 seconds	Depends on `unclean.leader.election`
Segment flush interval	`log.flush.interval.messages=9223372036854775807` (OS flush)

Production War Story

Situation: E-commerce platform with 50 partitions on topic orders. After adding 50 more partitions (total 100) for scaling, clients started complaining about “lost” orders and duplicate processing.

Root cause: When increasing partitions, hash(key) % 50 changed to hash(key) % 100. Events for the same order landed in different partitions. The order processing service collected events by key, but now “order created” was in partition 23, while “payment” was in partition 73. The payment handler couldn’t find the original order.

Additional problem: 100 partitions × 3 replicas × 5 brokers = 600 file descriptors for just one topic. Too many open files errors started occurring.

Solution:

Created new topic orders-v2 with 100 partitions

Dual-write: producers wrote to both topics for 72 hours — producers sent messages to both old and new topic simultaneously, for 72 hours so consumers could finish reading old data from the first and start reading new from the second.

Consumers switched to orders-v2

Increased ulimit -n from 4096 to 65536

Enabled log.retention.check.interval.ms=300000 for timely segment cleanup

Lesson: Never add partitions to a topic if you use keys for ordering guarantees. Create a new topic and migrate traffic.

Monitoring (JMX + Burrow)

Partition JMX metrics:

kafka.cluster:type=Partition,name=UnderReplicated
kafka.cluster:type=Partition,name=InSyncReplicasCount
kafka.server:type=ReplicaManager,name=IsrShrinksPerSec
kafka.server:type=ReplicaManager,name=IsrExpandsPerSec
kafka.log:type=Log,name=LogEndOffset,partition=0
kafka.log:type=Log,name=LogStartOffset,partition=0
kafka.log:type=Log,name=Size,partition=0

Burrow (consumer-centric):

Lag per partition — granularity down to partition level
Status: OK, WARN, ERR, STOP, STALL
HTTP API → Grafana → PagerDuty

Highload Best Practices

Plan partitions ahead — they cannot be reduced without recreating the topic
Formula: partitions = max(prod_throughput, cons_throughput) / per_partition_capacity * 1.5
No more than 4000 partitions per broker — otherwise controller degradation
Use KRaft for new clusters — faster metadata propagation, no ZK dependency
Monitor ISR Shrink/Expand — indicator of replication problems
Segment sizing: set segment.bytes so a segment closes in 1-4 hours
OS tuning: vm.dirty_background_ratio=5, vm.dirty_ratio=10 for flush control
Disk: SSD is mandatory for production. NVMe for high-throughput (>50 MB/s per partition)

🎯 Interview Cheat Sheet

Must know:

Partition — parallelism mechanism: 1 partition = 1 consumer
Ordering is strictly guaranteed only within a single partition (FIFO)
Offset — sequential message number, unique only within a partition
Each partition is replicated: Leader (read/write) + Followers (copy)
ISR — replicas synchronized with the leader; leader is chosen only from ISR
hash(key) % N — when adding partitions, key distribution changes
Reducing partitions is impossible; adding partitions breaks key ordering
Recommended maximum: 2000-4000 partitions per broker

Common follow-up questions:

What happens when you add partitions? — Old data is not redistributed; keys will land in different partitions → ordering broken.
What is High Watermark? — Last offset confirmed by all ISR. Consumers only see up to HW.
Can you reduce partitions? — No, only delete and recreate the topic.
What is Leader Epoch? — Leader generation number; prevents data loss on leader change.

Red flags (DO NOT say):

“Ordering is guaranteed across the whole topic” — only within a partition
“You can reduce partitions” — impossible
“Offset is globally unique” — unique only within a partition
“Adding partitions is safe when using keys” — breaks ordering

Related topics:

[[1. What is topic in Kafka]]
[[3. How is data distributed across partitions]]
[[17. What are leader and follower replicas]]
[[18. What is ISR (In-Sync Replicas)]]
[[28. How are old messages deleted from a topic]]