How does message compression work

Junior Level

Definition

Compression — compressing batches of messages to reduce network and disk space usage.

Without compression: 1MB batch
With lz4: 300KB batch (70% savings)

Available Algorithms

gzip   — best compression, slow speed
snappy — medium compression, fast speed
lz4    — good compression, very fast speed
zstd   — excellent compression, medium speed

Configuration

props.put("compression.type", "lz4");
// Options: none, gzip, snappy, lz4, zstd

Why is compression needed?

✅ Less network I/O
✅ Less disk usage on the broker
✅ Faster message transfer
✅ Lower cost for network traffic

Middle Level

Algorithm Comparison

Algorithm	Compression	Speed	CPU Usage	Recommendation
gzip	Best	Slow	High	Rarely used
snappy	Medium	Fast	Low	Low latency
lz4	Good	Very fast	Low	Default choice
zstd	Excellent	Medium	Medium	Max compression

Compression Configuration

Producer:

props.put("compression.type", "lz4");
// Compression is applied to the entire batch

Broker:

# Global setting (can be overridden at the topic level)
compression.type=producer — the broker does NOT recompress messages; it accepts
the producer's format. The value "uncompressed" forces the broker to decompress.

Topic:

kafka-configs.sh --alter --topic orders \
  --add-config compression.type=lz4 \
  --bootstrap-server localhost:9092

Compression Ratio

Compression ratio examples:
  JSON data: 3:1 - 5:1
  Text: 4:1 - 10:1
  Binary data: 1.5:1 - 2:1
  Already compressed data: 1:1 (no effect)

Common Mistakes

gzip for high-throughput:

High CPU overhead → bottleneck
→ Processing delays

Without compression:

Large messages → more network usage
→ Higher cost, slower transfer

Compressing already compressed data:

gzip on gzipped files → no effect
→ Wasted CPU

Senior Level

Internal Implementation

Compression Flow:

Producer accumulates messages in a batch
Batch is serialized
Compression algorithm is applied
Compressed batch is sent to the broker
Broker stores it compressed
Consumer receives it compressed, decompresses

Decompression:

Decompression happens on the consumer:
- CPU usage on the consumer
- Transparent to the application
- Kafka client library handles it automatically

Algorithm Selection

lz4 — default choice:

Advantages:
- Very fast compression/decompression
- Good compression ratio
- Low CPU usage
- Ideal for most use cases

Use cases:
- General purpose
- High throughput systems
- Low latency requirements

zstd — max compression:

Advantages:
- Best compression ratio
- Configurable compression level
- Faster than gzip

Use cases:
- Saving disk space
- Cross-DC replication (bandwidth savings)
- Long-term storage

snappy — legacy choice:

Advantages:
- Fast compression
- Support in older clients

Use cases:
- Compatibility with legacy systems
- Low latency with moderate compression

Compression Level Tuning

// zstd supports level configuration
// 1-22 (default 3)
props.put("compression.type", "zstd");
// In some clients:
props.put("zstd.level", "9");  // higher = better compression

Performance Impact

CPU overhead:
  none:  0%
  lz4:   ~5%
  snappy: ~10%
  zstd:  ~15-20%
  gzip:  ~30-40%

Network savings:
  none:  0%
  lz4:   ~50-70%
  snappy: ~40-60%
  zstd:  ~60-80%
  gzip:  ~70-85%

(approximate values for JSON/text data; actual figures depend on data type
and should be tested on production workload)

Monitoring

Key metrics:

kafka.producer:compression-rate-avg
kafka.producer:compression-time-avg
kafka.consumer:decompression-time-avg
kafka.server:bytes-in-per-sec
kafka.server:bytes-out-per-sec

Alerts:

- Compression rate < 1.2 → investigate
- Compression time > threshold → warning
- CPU usage on brokers > threshold → warning

Best Practices

✅ lz4 by default for most cases
✅ zstd for saving disk/bandwidth
✅ Monitor CPU usage on brokers
✅ Compression on batch (not per message)
✅ Test ratio on production data

❌ Without compression for production
❌ gzip for high-throughput systems
❌ Compressing already compressed data
❌ Without monitoring compression rate
❌ Ignoring CPU impact

Architectural Decisions

lz4 — balance — best choice for most cases
zstd — savings — when disk/bandwidth is critical
Compression on batch — better ratio
Monitoring compression rate — efficiency indicator

Summary for Senior

lz4 — best balance for most cases
zstd — when maximum compression is needed
Compression is applied to the entire batch
CPU overhead vs network savings trade-off
Monitoring compression rate is critical for efficiency

🎯 Interview Cheat Sheet

Must know:

Compression compresses batches to reduce network I/O and disk usage
Algorithms: gzip (best compression, slow), snappy (fast), lz4 (balance), zstd (excellent compression)
lz4 — default choice: very fast, good ratio, low CPU (~5%)
Compression is applied to the entire batch — larger batch = better ratio
JSON/text: 3:1 - 5:1 ratio; binary: 1.5:1 - 2:1; already compressed: 1:1
Producer compresses, consumer decompresses transparently (client library handles)
Broker does NOT recompress — accepts the producer’s format

Common follow-up questions:

Which algorithm to choose? — lz4 for most cases, zstd for saving disk/bandwidth.
Where does decompression happen? — On the consumer, transparently to the application.
Can you compress already compressed data? — You can, but no effect (1:1), wasted CPU.
What is the overhead of gzip? — ~30-40% CPU, rarely used in high-throughput systems.

Red flags (DO NOT say):

“The broker recompresses messages” — it accepts the producer’s format
“gzip is the best choice for production” — high CPU overhead, lz4 is better
“Compression on each message” — on the entire batch
“Compression is free” — CPU overhead 5-40% depending on the algorithm

Related topics:

[[21. What is batch in Kafka producer]]
[[1. What is a topic in Kafka]]
[[3. How is data distributed across partitions]]