Question 2 Β· Section 15

What is partition and why is it needed

Think of a post office with multiple service windows:

Language versions: English Russian Ukrainian

🟒 Junior Level

What is a partition?

Partition is a β€œsub-channel” within a topic. Each partition is an ordered, immutable sequence of messages that is continuously appended to.

Why partitions: the parallelism mechanism. One partition = one consumer can read it. If a topic were a single unit, only one consumer could work. Partitions allow N consumers to read one topic simultaneously.

Analogy

Think of a post office with multiple service windows:

  • Topic is the entire post office
  • Partitions are individual service windows
  • Each window serves its own queue of clients independently
  • Clients within one queue are served strictly in order
Topic "orders"
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚Partition 0  β”‚Partition 1  β”‚Partition 2  β”‚
β”‚ msg-0       β”‚ msg-0       β”‚ msg-0       β”‚
β”‚ msg-1       β”‚ msg-1       β”‚ msg-1       β”‚
β”‚ msg-2       β”‚ msg-2       β”‚ msg-2       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why are partitions needed?

  1. Parallelism β€” each partition is read by one consumer. 10 partitions = 10 consumers working in parallel
  2. Scalability β€” partitions are distributed across different brokers
  3. Fault tolerance β€” each partition is replicated to multiple brokers

Simple example

# Create topic with 3 partitions
kafka-topics.sh --create \
  --topic orders \
  --partitions 3 \
  --replication-factor 3 \
  --bootstrap-server localhost:9092
3 partitions, 3 consumers:
  Consumer 1 β†’ reads Partition 0
  Consumer 2 β†’ reads Partition 1
  Consumer 3 β†’ reads Partition 2

🟑 Middle Level

Partition anatomy

Offset β€” the sequential number of each message within a partition. Unique only within a single partition.

Partition 0:
  offset=0  {"userId": 1, "action": "login"}
  offset=1  {"userId": 2, "action": "purchase"}
  offset=2  {"userId": 1, "action": "logout"}

Ordering β€” strict ordering only within a single partition. There is no global ordering across a topic.

Immutability β€” messages cannot be modified or selectively deleted. Deletion happens in whole segments.

Partition replication

Partition 0 on a 3-broker cluster:
  Broker A β†’ Leader (handles read/write)
  Broker B β†’ Follower (copies from Leader)
  Broker C β†’ Follower (copies from Leader)
  • Reads and writes always go through the leader
  • Followers passively copy data
  • If the leader fails, one of the followers becomes the new leader

ISR (In-Sync Replicas)

ISR = set of replicas that have "caught up" with the leader

Leader: offset=100
Follower B: offset=100  β†’ in ISR
Follower C: offset=95   β†’ NOT in ISR (lagging)

ISR = {Leader, Follower B}

Choosing the number of partitions

Factor Recommendation
Producer throughput partitions >= producer_throughput / per_partition_throughput
Consumer throughput partitions >= consumer_throughput / per_consumer_throughput
Growth margin Multiply by 2-3

How data is distributed across partitions

partition = hash(key) % numPartitions    // if key exists
partition = round-robin / sticky         // if no key

Common mistakes

Mistake Consequence Solution
Single partition No parallelism At least 3-6 partitions
Too many partitions Controller load, memory, FD exhaustion Recommended no more than 2000-4000 per broker for stability. Technical limit β€” ~200,000 (KRaft), but controller performance degrades.
Adding partitions β€œon the fly” Key ordering broken Plan ahead
Uneven distribution Hot partition Check key distribution

Adding partitions

kafka-topics.sh --alter --topic orders --partitions 10

Consequences:

  1. Old data is not redistributed β€” stays in old partitions
  2. New messages go to all partitions
  3. hash(key) % N changes β†’ same keys land in different partitions β†’ ordering broken

πŸ”΄ Senior Level

Internal partition structure

At the Kafka code level, each partition is represented by a Partition object in ReplicaManager. Physically, a partition is a set of segment files on disk:

/var/kafka-logs/orders-0/
β”œβ”€β”€ 00000000000000000000.log          # Segment 0, data
β”œβ”€β”€ 00000000000000000000.index        # Segment 0, sparse index (offset β†’ physical position)
β”œβ”€β”€ 00000000000000000000.timeindex    # Segment 0, time-based index (timestamp β†’ offset)
β”œβ”€β”€ 00000000000000000000.txnindex     # Segment 0, transaction index
β”œβ”€β”€ 00000000000000100000.log          # Segment 1 (rotated at segment.bytes)
β”œβ”€β”€ 00000000000000100000.index
β”œβ”€β”€ 00000000000000100000.timeindex
β”œβ”€β”€ leader-epoch-checkpoint           # For preventing data loss on leader change
└── __txn_index__.0                   # Transaction metadata (if transactions enabled)

Log Segments β€” details

segment.bytes=1GB          # Rotation by size
segment.ms=7 days          # Rotation by time
segment.jitter.ms=0        # Random delay to avoid thundering herd
index.interval.bytes=4096  # Index write frequency

Sparse Index: Index not for every message, but every index.interval.bytes bytes. This keeps the index compact (~1 entry per 4KB of data). Search: binary search on index β†’ linear scan within segment.

Leader/Follower Replication Protocol

Replica Fetcher Thread (on each Follower):
  1. Send FetchRequest β†’ Leader
  2. Leader returns batch with current leader epoch
  3. Follower writes to its log
  4. Updates LEO (Log End Offset)
  5. Leader updates HW (High Watermark)

High Watermark (HW) = last offset confirmed by all ISR
Messages with offset > HW β€” not visible to consumers

LEO (Log End Offset) β€” number of the last written message.
HW (High Watermark) β€” number of the last message confirmed by all replicas. Consumers only see messages up to HW.

Leader Epoch: Introduced in Kafka 0.11 to eliminate data loss on leader change. Stored in leader-epoch-checkpoint file. On failover, the new leader truncates its log to the last confirmed epoch.

ISR (In-Sync Replicas) β€” deep dive

Conditions for being in ISR:
  replica.lag.time.max.ms=30000  // Follower must "catch up" within 30 seconds
  replica.fetch.wait.max.ms=500  // Max wait for fetch request

Replica State Machine:
  NewReplica β†’ OnlineReplica β†’ Leader/Follower β†’ OfflineReplica β†’ NonExistentReplica

OSR (Out-of-Sync Replicas): Replicas lagging behind the leader. Not counted towards min.insync.replicas.

KRaft Mode (Kafka Raft Metadata)

Before KRaft: ZooKeeper stored cluster metadata (topics, partitions, controller, ISR).

With KRaft (Kafka 3.3+ production-ready):

Controller Quorum (odd number, min 3 nodes):
  - Stores metadata in __cluster_metadata topic
  - Raft consensus protocol instead of ZK
  - Eliminates ZK as SPOF
  - Faster metadata change processing
  - Supports up to 200,000 partitions (vs 200,000 with ZK with degradation)

metadata.log.segment.bytes=100MB
metadata.log.segment.ms=1 day
metadata.max.retention.bytes=100MB

Edge Cases (3+)

  1. Unclean Leader Election: With unclean.leader.election.enable=true, an out-of-sync replica can become leader β†’ data loss (messages between the old leader’s HW and new leader’s LEO will be lost). For financial systems, always set to false.

  2. Split-Brain with KRaft: If the controller quorum loses majority (e.g., 2 out of 5 nodes are down), the cluster cannot change metadata (create topics, elect leaders). Reading/writing existing partitions continues to work.

  3. Log Divergence on Network Partition: If leader and follower are network-partitioned, the leader continues writing. On network recovery, the follower truncates to the leader’s HW. If HW hasn’t been updated yet β€” producer-confirmed messages may be lost (with acks=1).

  4. Metadata Propagation Delay: In large clusters (10K+ partitions), metadata updates (new topic, new partition) can take 30-60 seconds. Producers/consumers receive stale metadata and send messages to wrong brokers (NotLeaderForPartitionException).

  5. Disk Failure on Single Replica: If a disk with a partition fails, ISR shrinks. If min.insync.replicas is not met β€” producers with acks=all receive NotEnoughReplicasException. Kafka does not automatically restore data from other brokers β€” manual intervention is required.

Performance Numbers

Metric Value Conditions
Throughput per partition ~50-100 MB/s SSD, batch.size=256KB, lz4
Max partitions/broker ~200,000 KRaft, 64GB RAM, SSD
Recommended partitions/broker 2,000-4,000 For stable operation
Replication lag (normal) < 100 ms Within a single AZ
Leader election time 5-30 seconds Depends on unclean.leader.election
Segment flush interval log.flush.interval.messages=9223372036854775807 (OS flush) Β 

Production War Story

Situation: E-commerce platform with 50 partitions on topic orders. After adding 50 more partitions (total 100) for scaling, clients started complaining about β€œlost” orders and duplicate processing.

Root cause: When increasing partitions, hash(key) % 50 changed to hash(key) % 100. Events for the same order landed in different partitions. The order processing service collected events by key, but now β€œorder created” was in partition 23, while β€œpayment” was in partition 73. The payment handler couldn’t find the original order.

Additional problem: 100 partitions Γ— 3 replicas Γ— 5 brokers = 600 file descriptors for just one topic. Too many open files errors started occurring.

Solution:

  1. Created new topic orders-v2 with 100 partitions
  2. Dual-write: producers wrote to both topics for 72 hours β€” producers sent messages to both old and new topic simultaneously, for 72 hours so consumers could finish reading old data from the first and start reading new from the second.
  3. Consumers switched to orders-v2
  4. Increased ulimit -n from 4096 to 65536
  5. Enabled log.retention.check.interval.ms=300000 for timely segment cleanup

Lesson: Never add partitions to a topic if you use keys for ordering guarantees. Create a new topic and migrate traffic.

Monitoring (JMX + Burrow)

Partition JMX metrics:

kafka.cluster:type=Partition,name=UnderReplicated
kafka.cluster:type=Partition,name=InSyncReplicasCount
kafka.server:type=ReplicaManager,name=IsrShrinksPerSec
kafka.server:type=ReplicaManager,name=IsrExpandsPerSec
kafka.log:type=Log,name=LogEndOffset,partition=0
kafka.log:type=Log,name=LogStartOffset,partition=0
kafka.log:type=Log,name=Size,partition=0

Burrow (consumer-centric):

  • Lag per partition β€” granularity down to partition level
  • Status: OK, WARN, ERR, STOP, STALL
  • HTTP API β†’ Grafana β†’ PagerDuty

Highload Best Practices

  1. Plan partitions ahead β€” they cannot be reduced without recreating the topic
  2. Formula: partitions = max(prod_throughput, cons_throughput) / per_partition_capacity * 1.5
  3. No more than 4000 partitions per broker β€” otherwise controller degradation
  4. Use KRaft for new clusters β€” faster metadata propagation, no ZK dependency
  5. Monitor ISR Shrink/Expand β€” indicator of replication problems
  6. Segment sizing: set segment.bytes so a segment closes in 1-4 hours
  7. OS tuning: vm.dirty_background_ratio=5, vm.dirty_ratio=10 for flush control
  8. Disk: SSD is mandatory for production. NVMe for high-throughput (>50 MB/s per partition)

🎯 Interview Cheat Sheet

Must know:

  • Partition β€” parallelism mechanism: 1 partition = 1 consumer
  • Ordering is strictly guaranteed only within a single partition (FIFO)
  • Offset β€” sequential message number, unique only within a partition
  • Each partition is replicated: Leader (read/write) + Followers (copy)
  • ISR β€” replicas synchronized with the leader; leader is chosen only from ISR
  • hash(key) % N β€” when adding partitions, key distribution changes
  • Reducing partitions is impossible; adding partitions breaks key ordering
  • Recommended maximum: 2000-4000 partitions per broker

Common follow-up questions:

  • What happens when you add partitions? β€” Old data is not redistributed; keys will land in different partitions β†’ ordering broken.
  • What is High Watermark? β€” Last offset confirmed by all ISR. Consumers only see up to HW.
  • Can you reduce partitions? β€” No, only delete and recreate the topic.
  • What is Leader Epoch? β€” Leader generation number; prevents data loss on leader change.

Red flags (DO NOT say):

  • β€œOrdering is guaranteed across the whole topic” β€” only within a partition
  • β€œYou can reduce partitions” β€” impossible
  • β€œOffset is globally unique” β€” unique only within a partition
  • β€œAdding partitions is safe when using keys” β€” breaks ordering

Related topics:

  • [[1. What is topic in Kafka]]
  • [[3. How is data distributed across partitions]]
  • [[17. What are leader and follower replicas]]
  • [[18. What is ISR (In-Sync Replicas)]]
  • [[28. How are old messages deleted from a topic]]