Question 22 · Section 15

How does message compression work

4. Monitoring compression rate — efficiency indicator

Language versions: English Russian Ukrainian

Junior Level

Definition

Compression — compressing batches of messages to reduce network and disk space usage.

Without compression: 1MB batch
With lz4: 300KB batch (70% savings)

Available Algorithms

gzip   — best compression, slow speed
snappy — medium compression, fast speed
lz4    — good compression, very fast speed
zstd   — excellent compression, medium speed

Configuration

props.put("compression.type", "lz4");
// Options: none, gzip, snappy, lz4, zstd

Why is compression needed?

✅ Less network I/O
✅ Less disk usage on the broker
✅ Faster message transfer
✅ Lower cost for network traffic

Middle Level

Algorithm Comparison

Algorithm Compression Speed CPU Usage Recommendation
gzip Best Slow High Rarely used
snappy Medium Fast Low Low latency
lz4 Good Very fast Low Default choice
zstd Excellent Medium Medium Max compression

Compression Configuration

Producer:

props.put("compression.type", "lz4");
// Compression is applied to the entire batch

Broker:

# Global setting (can be overridden at the topic level)
compression.type=producer — the broker does NOT recompress messages; it accepts
the producer's format. The value "uncompressed" forces the broker to decompress.

Topic:

kafka-configs.sh --alter --topic orders \
  --add-config compression.type=lz4 \
  --bootstrap-server localhost:9092

Compression Ratio

Compression ratio examples:
  JSON data: 3:1 - 5:1
  Text: 4:1 - 10:1
  Binary data: 1.5:1 - 2:1
  Already compressed data: 1:1 (no effect)

Common Mistakes

  1. gzip for high-throughput:
    High CPU overhead → bottleneck
    → Processing delays
    
  2. Without compression:
    Large messages → more network usage
    → Higher cost, slower transfer
    
  3. Compressing already compressed data:
    gzip on gzipped files → no effect
    → Wasted CPU
    

Senior Level

Internal Implementation

Compression Flow:

1. Producer accumulates messages in a batch
2. Batch is serialized
3. Compression algorithm is applied
4. Compressed batch is sent to the broker
5. Broker stores it compressed
6. Consumer receives it compressed, decompresses

Decompression:

Decompression happens on the consumer:
- CPU usage on the consumer
- Transparent to the application
- Kafka client library handles it automatically

Algorithm Selection

lz4 — default choice:

Advantages:
- Very fast compression/decompression
- Good compression ratio
- Low CPU usage
- Ideal for most use cases

Use cases:
- General purpose
- High throughput systems
- Low latency requirements

zstd — max compression:

Advantages:
- Best compression ratio
- Configurable compression level
- Faster than gzip

Use cases:
- Saving disk space
- Cross-DC replication (bandwidth savings)
- Long-term storage

snappy — legacy choice:

Advantages:
- Fast compression
- Support in older clients

Use cases:
- Compatibility with legacy systems
- Low latency with moderate compression

Compression Level Tuning

// zstd supports level configuration
// 1-22 (default 3)
props.put("compression.type", "zstd");
// In some clients:
props.put("zstd.level", "9");  // higher = better compression

Performance Impact

CPU overhead:
  none:  0%
  lz4:   ~5%
  snappy: ~10%
  zstd:  ~15-20%
  gzip:  ~30-40%

Network savings:
  none:  0%
  lz4:   ~50-70%
  snappy: ~40-60%
  zstd:  ~60-80%
  gzip:  ~70-85%

(approximate values for JSON/text data; actual figures depend on data type
and should be tested on production workload)

Monitoring

Key metrics:

kafka.producer:compression-rate-avg
kafka.producer:compression-time-avg
kafka.consumer:decompression-time-avg
kafka.server:bytes-in-per-sec
kafka.server:bytes-out-per-sec

Alerts:

- Compression rate < 1.2 → investigate
- Compression time > threshold → warning
- CPU usage on brokers > threshold → warning

Best Practices

✅ lz4 by default for most cases
✅ zstd for saving disk/bandwidth
✅ Monitor CPU usage on brokers
✅ Compression on batch (not per message)
✅ Test ratio on production data

❌ Without compression for production
❌ gzip for high-throughput systems
❌ Compressing already compressed data
❌ Without monitoring compression rate
❌ Ignoring CPU impact

Architectural Decisions

  1. lz4 — balance — best choice for most cases
  2. zstd — savings — when disk/bandwidth is critical
  3. Compression on batch — better ratio
  4. Monitoring compression rate — efficiency indicator

Summary for Senior

  • lz4 — best balance for most cases
  • zstd — when maximum compression is needed
  • Compression is applied to the entire batch
  • CPU overhead vs network savings trade-off
  • Monitoring compression rate is critical for efficiency

🎯 Interview Cheat Sheet

Must know:

  • Compression compresses batches to reduce network I/O and disk usage
  • Algorithms: gzip (best compression, slow), snappy (fast), lz4 (balance), zstd (excellent compression)
  • lz4 — default choice: very fast, good ratio, low CPU (~5%)
  • Compression is applied to the entire batch — larger batch = better ratio
  • JSON/text: 3:1 - 5:1 ratio; binary: 1.5:1 - 2:1; already compressed: 1:1
  • Producer compresses, consumer decompresses transparently (client library handles)
  • Broker does NOT recompress — accepts the producer’s format

Common follow-up questions:

  • Which algorithm to choose? — lz4 for most cases, zstd for saving disk/bandwidth.
  • Where does decompression happen? — On the consumer, transparently to the application.
  • Can you compress already compressed data? — You can, but no effect (1:1), wasted CPU.
  • What is the overhead of gzip? — ~30-40% CPU, rarely used in high-throughput systems.

Red flags (DO NOT say):

  • “The broker recompresses messages” — it accepts the producer’s format
  • “gzip is the best choice for production” — high CPU overhead, lz4 is better
  • “Compression on each message” — on the entire batch
  • “Compression is free” — CPU overhead 5-40% depending on the algorithm

Related topics:

  • [[21. What is batch in Kafka producer]]
  • [[1. What is a topic in Kafka]]
  • [[3. How is data distributed across partitions]]