How does message compression work
4. Monitoring compression rate — efficiency indicator
Junior Level
Definition
Compression — compressing batches of messages to reduce network and disk space usage.
Without compression: 1MB batch
With lz4: 300KB batch (70% savings)
Available Algorithms
gzip — best compression, slow speed
snappy — medium compression, fast speed
lz4 — good compression, very fast speed
zstd — excellent compression, medium speed
Configuration
props.put("compression.type", "lz4");
// Options: none, gzip, snappy, lz4, zstd
Why is compression needed?
✅ Less network I/O
✅ Less disk usage on the broker
✅ Faster message transfer
✅ Lower cost for network traffic
Middle Level
Algorithm Comparison
| Algorithm | Compression | Speed | CPU Usage | Recommendation |
|---|---|---|---|---|
| gzip | Best | Slow | High | Rarely used |
| snappy | Medium | Fast | Low | Low latency |
| lz4 | Good | Very fast | Low | Default choice |
| zstd | Excellent | Medium | Medium | Max compression |
Compression Configuration
Producer:
props.put("compression.type", "lz4");
// Compression is applied to the entire batch
Broker:
# Global setting (can be overridden at the topic level)
compression.type=producer — the broker does NOT recompress messages; it accepts
the producer's format. The value "uncompressed" forces the broker to decompress.
Topic:
kafka-configs.sh --alter --topic orders \
--add-config compression.type=lz4 \
--bootstrap-server localhost:9092
Compression Ratio
Compression ratio examples:
JSON data: 3:1 - 5:1
Text: 4:1 - 10:1
Binary data: 1.5:1 - 2:1
Already compressed data: 1:1 (no effect)
Common Mistakes
- gzip for high-throughput:
High CPU overhead → bottleneck → Processing delays - Without compression:
Large messages → more network usage → Higher cost, slower transfer - Compressing already compressed data:
gzip on gzipped files → no effect → Wasted CPU
Senior Level
Internal Implementation
Compression Flow:
1. Producer accumulates messages in a batch
2. Batch is serialized
3. Compression algorithm is applied
4. Compressed batch is sent to the broker
5. Broker stores it compressed
6. Consumer receives it compressed, decompresses
Decompression:
Decompression happens on the consumer:
- CPU usage on the consumer
- Transparent to the application
- Kafka client library handles it automatically
Algorithm Selection
lz4 — default choice:
Advantages:
- Very fast compression/decompression
- Good compression ratio
- Low CPU usage
- Ideal for most use cases
Use cases:
- General purpose
- High throughput systems
- Low latency requirements
zstd — max compression:
Advantages:
- Best compression ratio
- Configurable compression level
- Faster than gzip
Use cases:
- Saving disk space
- Cross-DC replication (bandwidth savings)
- Long-term storage
snappy — legacy choice:
Advantages:
- Fast compression
- Support in older clients
Use cases:
- Compatibility with legacy systems
- Low latency with moderate compression
Compression Level Tuning
// zstd supports level configuration
// 1-22 (default 3)
props.put("compression.type", "zstd");
// In some clients:
props.put("zstd.level", "9"); // higher = better compression
Performance Impact
CPU overhead:
none: 0%
lz4: ~5%
snappy: ~10%
zstd: ~15-20%
gzip: ~30-40%
Network savings:
none: 0%
lz4: ~50-70%
snappy: ~40-60%
zstd: ~60-80%
gzip: ~70-85%
(approximate values for JSON/text data; actual figures depend on data type
and should be tested on production workload)
Monitoring
Key metrics:
kafka.producer:compression-rate-avg
kafka.producer:compression-time-avg
kafka.consumer:decompression-time-avg
kafka.server:bytes-in-per-sec
kafka.server:bytes-out-per-sec
Alerts:
- Compression rate < 1.2 → investigate
- Compression time > threshold → warning
- CPU usage on brokers > threshold → warning
Best Practices
✅ lz4 by default for most cases
✅ zstd for saving disk/bandwidth
✅ Monitor CPU usage on brokers
✅ Compression on batch (not per message)
✅ Test ratio on production data
❌ Without compression for production
❌ gzip for high-throughput systems
❌ Compressing already compressed data
❌ Without monitoring compression rate
❌ Ignoring CPU impact
Architectural Decisions
- lz4 — balance — best choice for most cases
- zstd — savings — when disk/bandwidth is critical
- Compression on batch — better ratio
- Monitoring compression rate — efficiency indicator
Summary for Senior
- lz4 — best balance for most cases
- zstd — when maximum compression is needed
- Compression is applied to the entire batch
- CPU overhead vs network savings trade-off
- Monitoring compression rate is critical for efficiency
🎯 Interview Cheat Sheet
Must know:
- Compression compresses batches to reduce network I/O and disk usage
- Algorithms: gzip (best compression, slow), snappy (fast), lz4 (balance), zstd (excellent compression)
- lz4 — default choice: very fast, good ratio, low CPU (~5%)
- Compression is applied to the entire batch — larger batch = better ratio
- JSON/text: 3:1 - 5:1 ratio; binary: 1.5:1 - 2:1; already compressed: 1:1
- Producer compresses, consumer decompresses transparently (client library handles)
- Broker does NOT recompress — accepts the producer’s format
Common follow-up questions:
- Which algorithm to choose? — lz4 for most cases, zstd for saving disk/bandwidth.
- Where does decompression happen? — On the consumer, transparently to the application.
- Can you compress already compressed data? — You can, but no effect (1:1), wasted CPU.
- What is the overhead of gzip? — ~30-40% CPU, rarely used in high-throughput systems.
Red flags (DO NOT say):
- “The broker recompresses messages” — it accepts the producer’s format
- “gzip is the best choice for production” — high CPU overhead, lz4 is better
- “Compression on each message” — on the entire batch
- “Compression is free” — CPU overhead 5-40% depending on the algorithm
Related topics:
- [[21. What is batch in Kafka producer]]
- [[1. What is a topic in Kafka]]
- [[3. How is data distributed across partitions]]