What is DLQ (Dead Letter Queue) in Kafka
Imagine a gift-wrapping factory. A worker picks up a gift and tries to wrap it. If the gift is damaged (broken, cracked):
Junior Level
Simple Definition
DLQ (Dead Letter Queue) — a special Kafka topic where messages that could not be processed after several attempts are sent. It is an “isolator” for problematic data.
Analogy
Imagine a gift-wrapping factory. A worker picks up a gift and tries to wrap it. If the gift is damaged (broken, cracked):
- Try again (retry)
- If after 3 attempts it is still broken — put it in the “defective” basket (DLQ)
- Continue wrapping the next gifts
The conveyor doesn’t stop, and the defective gifts can be examined and fixed later.
Example with Spring Kafka
@Configuration
public class KafkaConfig {
@Bean
public ConcurrentKafkaListenerContainerFactory<String, String> kafkaListenerContainerFactory(
ConsumerFactory<String, String> consumerFactory,
KafkaTemplate<String, String> kafkaTemplate) {
var factory = new ConcurrentKafkaListenerContainerFactory<String, String>();
factory.setConsumerFactory(consumerFactory);
// DLQ configuration
factory.setCommonErrorHandler(new DefaultErrorHandler(
new DeadLetterPublishingRecoverer(kafkaTemplate),
new FixedBackOff(1000L, 3) // 3 retries, 1 second between attempts
));
return factory;
}
}
What Goes to DLQ
- Corrupt data (invalid JSON, schema mismatch)
- Cannot be processed (database unavailable > max retries)
- Business rule violations (negative price, invalid email)
- External service errors (API rate limited, timeout)
When to Use
- Critical data — messages cannot be lost
- Need for analysis — understanding why processing failed
- Compliance — audit and traceability
- Production — always have a DLQ plan
Middle Level
DLQ Architecture
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ Producer │────>│ Main Topic │────>│ Consumer │
└─────────────┘ └──────────────┘ └──────┬──────┘
│
┌───────────┼───────────┐
│ │ │
Success Error → Retry │
│ │
Max retries? │
Yes │ │
▼ │
┌──────────────┴──┐
│ DLQ Topic │
│ (orders-dlq) │
└──────────────────┘
DLQ Message Enrichment
Simply sending the original message to the DLQ is not enough. Context must be added:
public class DLQEnricher {
public ProducerRecord<String, String> enrichForDLQ(
ConsumerRecord<String, String> original,
Exception error) {
ProducerRecord<String, String> dlqRecord = new ProducerRecord<>(
original.topic() + "-dlq", // orders → orders-dlq
original.key(),
original.value()
);
// Metadata for debugging
dlqRecord.headers().add("x-original-topic", original.topic().getBytes(StandardCharsets.UTF_8));
dlqRecord.headers().add("x-original-partition", String.valueOf(original.partition()).getBytes());
dlqRecord.headers().add("x-original-offset", String.valueOf(original.offset()).getBytes());
dlqRecord.headers().add("x-original-timestamp", String.valueOf(original.timestamp()).getBytes());
dlqRecord.headers().add("x-consumer-group", "order-service".getBytes());
dlqRecord.headers().add("x-error-message", error.getMessage().getBytes(StandardCharsets.UTF_8));
dlqRecord.headers().add("x-error-stacktrace", getStackTrace(error).getBytes(StandardCharsets.UTF_8));
dlqRecord.headers().add("x-retry-count", "3".getBytes());
dlqRecord.headers().add("x-dlq-timestamp", String.valueOf(System.currentTimeMillis()).getBytes());
return dlqRecord;
}
}
DLQ Patterns
| Pattern | Description | When to Use |
|---|---|---|
| Single DLQ | One DLQ topic for all errors | Simple systems |
| Per-topic DLQ | topic-dlq for each topic |
Different team responsibilities |
| Per-error DLQ | topic-dlq-validation, topic-dlq-timeout |
Detailed analysis |
| Multi-level DLQ | DLQ → DLQ2 → Manual review | High-compliance systems |
DLQ Consumer — Processing the DLQ
// DLQ Consumer — reads the DLQ and decides what to do
@KafkaListener(topics = "orders-dlq", groupId = "dlq-handler")
public void handleDLQ(ConsumerRecord<String, String> record) {
String originalTopic = getHeader(record, "x-original-topic");
String errorMessage = getHeader(record, "x-error-message");
String stacktrace = getHeader(record, "x-error-stacktrace");
// Error classification
if (errorMessage.contains("Timeout")) {
// Temporary error — retry
retryMessage(record);
} else if (errorMessage.contains("ConstraintViolation")) {
// Data issue — manual review
sendToManualReview(record);
} else if (errorMessage.contains("Schema")) {
// Schema mismatch — fix producer
alertProducerTeam(record);
} else {
// Unknown — manual review
sendToManualReview(record);
}
}
Common Errors Table
| Error | Symptoms | Consequences | Solution |
|---|---|---|---|
| Without DLQ | Poison pill blocks consumer | Consumer lag ∞, pipeline stall | Add DLQ |
| DLQ without monitoring | DLQ grows indefinitely | Disk fill, missed errors | Alert on DLQ size |
| DLQ without metadata | Impossible to debug | Hours spent on investigation | Enrich headers |
| DLQ = original topic | Infinite loop | Message bounce between topics | Separate DLQ topic |
| DLQ without retention | DLQ occupies disk forever | Disk fill | Configure retention |
| Without DLQ consumer | DLQ messages accumulate | No automatic recovery | DLQ consumer for auto-retry |
When NOT to Use DLQ
- Non-recoverable data — logs, metrics that can be lost
- Idempotent operations — retry is safe to repeat indefinitely
- Test environments — can simply be restarted
- Very low-volume (< 10 msg/min) — simpler to alert and fix manually
- For fully idempotent operations with automatic retry, DLQ may be excessive — alerting on retry exhaustion is sufficient.
Senior Level
Deep Internals
DLQ in Spring Kafka — DeadLetterPublishingRecoverer
// Internal implementation
public class DeadLetterPublishingRecoverer implements ConsumerRecordRecoverer {
private final Function<ConsumerRecord<?, ?>, TopicPartition> destinationResolver;
private final KafkaTemplate<Object, Object> template;
@Override
public void accept(ConsumerRecord<?, ?> record, Exception exception) {
// 1. Determine DLQ topic
TopicPartition dest = destinationResolver.apply(record);
// 2. Create enriched record
ProducerRecord<Object, Object> dlqRecord = createProducerRecord(record, exception);
// 3. Send to DLQ (sync to guarantee delivery)
template.send(dlqRecord).get(); // Blocking!
// 4. Log
log.error("Sent to DLQ: topic={}, partition={}, offset={}",
record.topic(), record.partition(), record.offset());
}
// Header enrichment
private ProducerRecord<Object, Object> createProducerRecord(
ConsumerRecord<?, ?> record, Exception exception) {
// Copies key, value
// Adds headers:
// - KafkaHeaders.DLT_EXCEPTION_FQCN (exception class name)
// - KafkaHeaders.DLT_EXCEPTION_MESSAGE (exception message)
// - KafkaHeaders.DLT_ORIGINAL_TOPIC
// - KafkaHeaders.DLT_ORIGINAL_PARTITION
// - KafkaHeaders.DLT_ORIGINAL_OFFSET
// - KafkaHeaders.DLT_ORIGINAL_TIMESTAMP
}
}
Kafka Streams — Branching for DLQ
// Kafka Streams has no built-in DLQ
// Implement via branching
KStream<String, String> orders = builder.stream("orders");
// Split: valid vs invalid
KStream<String, String>[] branches = orders.branch(
(key, value) -> isValid(value), // valid stream
(key, value) -> !isValid(value) // invalid → DLQ
);
// Valid → process
branches[0].process(() -> orderProcessor);
// Invalid → DLQ topic
branches[1].mapValues(value -> enrichWithMetadata(value, "validation-error"))
.to("orders-dlq");
Apache Camel / Kafka Connect DLQ
# Kafka Connect — built-in DLQ support
errors.tolerance=all
errors.deadletterqueue.topic.name=connect-dlq
errors.deadletterqueue.context.headers.enable=true
errors.log.enable=true
errors.log.include.messages=true
# Connect automatically sends failed records to DLQ
# with context headers
Trade-offs
| Aspect | Without DLQ | With DLQ |
|---|---|---|
| Data safety | Low (poison pill data loss) | High |
| System availability | Low (stall on error) | High |
| Complexity | Low | Medium |
| Debugging time | Hours-days | Minutes |
| Storage overhead | 0% | 0.1–1% (DLQ messages) |
| Operational overhead | High (manual investigation) | Low (automated) |
Edge Cases
1. DLQ overflow — when the DLQ itself becomes a problem:
Scenario:
Producer bug → all messages are invalid
100K messages/min → DLQ
DLQ topic: replication factor 3, retention 7 days
DLQ storage: 100K × 60 × 24 × 7 × 1KB = 10GB/day
Problem: DLQ fills disk faster than main topic!
Resolution:
1. DLQ retention.ms = 86400000 (1 day, not 7)
2. DLQ consumer for fast drain
3. Alert on DLQ rate > threshold
4. Circuit breaker: stop sending to DLQ, alert only
2. DLQ ordering and causality:
Scenario:
Partition 0: order-1 (success), order-2 (DLQ), order-3 (success)
order-3 depends on order-2 (update after create)
order-2 in DLQ → order-3 processed but fails (no order-2)
Problem: Causality broken — downstream processing depends on DLQ'd message
Resolution:
1. Saga pattern: order-3 checks prerequisite
2. Compensating action: if order-2 is in DLQ → rollback order-1
3. Partition by correlation key — related messages in the same partition
3. DLQ replay — resending from DLQ:
// DLQ replay consumer
@KafkaListener(topics = "orders-dlq", groupId = "dlq-replayer")
public void replay(ConsumerRecord<String, String> record) {
String errorMessage = getHeader(record, "x-error-message");
String originalTopic = getHeader(record, "x-original-topic");
if (errorMessage.contains("Timeout") && isDatabaseHealthy()) {
// Retry — the error was temporary
originalProducer.send(new ProducerRecord<>(
originalTopic, record.key(), record.value()
));
log.info("Replayed from DLQ: {}", record.offset());
} else {
// Not retryable — manual review
log.warn("Non-replayable DLQ message: {}", record.offset());
}
}
// On repeated error — send to manual-review queue, NOT infinite retry.
// DLQ-for-DLQ pattern: after N retries → a second DLQ → manual intervention.
4. DLQ and Exactly-Once Semantics:
Problem:
Consumer reads message → error → send to DLQ → commit offset
Consumer crashes AFTER DLQ send, BEFORE commit
On restart: message read again → sent to DLQ again → duplicate in DLQ
Resolution:
1. Transactional DLQ producer:
transaction {
process(record);
if (error) dlqProducer.send(dlqRecord);
consumer.commitSync();
}
2. DLQ deduplication key = original topic + partition + offset
3. Idempotent DLQ producer
5. Multi-tenant DLQ Isolation:
Scenario:
SaaS platform, 100 tenants, shared Kafka topics
Tenant A errors → DLQ
Tenant B errors → same DLQ
Tenant C sees DLQ messages from A and B (data leak!)
Resolution:
Per-tenant DLQ topics:
orders-tenant-a-dlq
orders-tenant-b-dlq
Or DLQ with tenant-id header + ACLs
Performance (Production Numbers)
| DLQ Pattern | Overhead | Throughput Impact | Latency Impact | Storage |
|---|---|---|---|---|
| No DLQ | 0% | 0% | 0ms | 0 |
| Sync DLQ send | +10–30ms/msg | -5% | +20ms | 0.1–1% |
| Async DLQ send | +1ms | -1% | +2ms | 0.1–1% |
| DLQ + enrichment | +15–40ms/msg | -8% | +30ms | 0.5–2% |
| Transactional DLQ | +50ms | -15% | +50ms | 0.1–1% |
Production War Story
Situation: FinTech, transaction processing. A DLQ existed but was not monitored.
Problem: Discovered that the DLQ topic grew to 50M messages over 3 months. 50M transactions of $10–500 each were not processed. Potential financial exposure: $5M+.
Investigation:
- DLQ consumer existed but crashed 3 months ago
- Nobody noticed (no alerting)
- DLQ messages accumulated without processing
- Retention = -1 (infinite) → disk fill risk
- Root cause: producer schema change, validation messages in DLQ
- 50M messages = schema violation errors
Resolution:
- Immediate: Alert on DLQ size > 1000 messages
- Short-term: DLQ consumer restart + replay for retriable errors
- Medium: Schema validation on the producer side (prevention)
- Long: DLQ auto-classification + auto-retry for transient errors
- Operational: DLQ dashboard, on-call rotation for DLQ review
Post-mortem:
- DLQ without monitoring = no DLQ (false sense of security)
- DLQ retention policy is mandatory
- Schema registry for producer validation
- Regular DLQ review process (daily)
Monitoring (JMX, Prometheus, Burrow)
JMX metrics (Spring Kafka):
spring.kafka.consumer.dlq.messages.sent.total
spring.kafka.consumer.dlq.messages.sent.rate
spring.kafka.consumer.retry.attempts.total
spring.kafka.consumer.retry.success.total
Prometheus + Grafana:
- record: kafka_dlq_message_rate
expr: rate(kafka_topic_messages_total{topic=".*-dlq"}[5m])
- alert: KafkaDLQSizeExceeded
expr: kafka_topic_partition_log_size{topic=".*-dlq"} > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "DLQ size exceeded 10,000 messages"
- alert: KafkaDLQMessageRateHigh
expr: rate(kafka_dlq_message_total[5m]) > 10
for: 2m
labels:
severity: critical
annotations:
summary: "More than 10 messages/minute going to DLQ"
- alert: KafkaDLQConsumerDown
expr: kafka_consumer_lag{consumer_group="dlq-handler"} > 1000
for: 5m
labels:
severity: critical
annotations:
summary: "DLQ consumer is falling behind"
- alert: KafkaDLQDiskUsageHigh
expr: kafka_log_size_bytes{topic=".*-dlq"} / kafka_disk_capacity_bytes > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "DLQ using > 10% disk capacity"
Burrow for DLQ consumer monitoring:
GET /v3/kafka/cluster/consumer/dlq-handler/status
{
"status": "OK",
"lag": {
"orders-dlq": {
"partition_0": {"current_lag": 0},
"partition_1": {"current_lag": 500} // DLQ consumer falling behind
}
}
}
Highload Best Practices
✅ Separate DLQ topic (don't re-use original topic)
✅ DLQ enrichment with metadata (topic, partition, offset, exception)
✅ DLQ retention policy (1–7 days, not infinite)
✅ Alert on DLQ messages (never silent DLQ)
✅ DLQ consumer for automated retry/replay
✅ DLQ dashboard (message count, error types, age)
✅ Schema validation on the producer side (prevent poison pills)
✅ Per-error-type DLQ routing for automated classification
✅ Idempotent DLQ producer (prevent duplicate DLQ messages)
✅ Circuit breaker on DLQ send rate (prevent DLQ overflow)
❌ Without DLQ in production
❌ DLQ without monitoring/alerting
❌ DLQ without retention policy (disk fill)
❌ DLQ = original topic (infinite loop)
❌ Without DLQ consumer (messages accumulate forever)
❌ Without metadata in DLQ (impossible to debug)
❌ DLQ without enrichment (lost context)
Architectural Decisions
- DLQ — not just a topic, but a process — producer → DLQ → consumer → replay → fix
- Enrichment is critical — without metadata, DLQ is useless for debugging
- Auto-classification — categorize DLQ errors for automated handling
- Prevention > cure — schema validation, producer-side checks
- DLQ retention — balance between debugging needs and disk usage
Summary for Senior
- DLQ — insurance policy against poison pill and data loss
- DLQ without monitoring = no DLQ (false sense of security)
- Enrichment headers — critical for debugging and replay
- DLQ consumer is needed for automated retry/replay capability
- Per-error routing → automated classification → reduced manual review
- Schema validation on the producer — best way to reduce DLQ volume
- DLQ retention policy is mandatory for disk management
- Transactional DLQ for exactly-once DLQ semantics
- At highload: DLQ rate can become a problem — circuit breaker is needed
- DLQ dashboard + on-call process — operational necessity
🎯 Interview Cheat Sheet
Must know:
- DLQ — a special topic for messages not processed after max retries
- DLQ enrichment: original topic, partition, offset, exception message, stacktrace, timestamp
- Patterns: Single DLQ, Per-topic DLQ (
topic-dlq), Per-error DLQ, Multi-level DLQ - DLQ consumer: reads DLQ, classifies errors, retry or manual review
- DLQ retention policy (1-7 days) — without it, disk fill is guaranteed
- DLQ without monitoring = no DLQ (false sense of security)
- Transactional DLQ producer prevents duplicate DLQ messages
Common follow-up questions:
- Why enrich DLQ messages? — Without metadata, debugging and replay are impossible.
- What if the DLQ producer also fails? — Retry limit → manual-review queue.
- How to replay from DLQ? — DLQ consumer reads, checks the fix, resends to the original topic.
- DLQ = original topic? — NO! Infinite loop: message bounce between topics.
Red flags (DO NOT say):
- “DLQ without monitoring is fine” — messages accumulate forever unnoticed
- “DLQ without retention — store forever” — disk fill guaranteed
- “DLQ = original topic” — infinite retry loop
- “DLQ is not needed for idempotent operations” — poison pills still need it
Related topics:
- [[24. How to handle errors when reading messages]]
- [[26. How to monitor consumer lag]]
- [[11. How to configure exactly-once semantics]]
- [[10. What is the difference between at-most-once, at-least-once and exactly-once]]