What is DLQ (Dead Letter Queue) in Kafka

Junior Level

Simple Definition

DLQ (Dead Letter Queue) — a special Kafka topic where messages that could not be processed after several attempts are sent. It is an “isolator” for problematic data.

Analogy

Imagine a gift-wrapping factory. A worker picks up a gift and tries to wrap it. If the gift is damaged (broken, cracked):

Try again (retry)
If after 3 attempts it is still broken — put it in the “defective” basket (DLQ)
Continue wrapping the next gifts

The conveyor doesn’t stop, and the defective gifts can be examined and fixed later.

Example with Spring Kafka

@Configuration
public class KafkaConfig {

    @Bean
    public ConcurrentKafkaListenerContainerFactory<String, String> kafkaListenerContainerFactory(
            ConsumerFactory<String, String> consumerFactory,
            KafkaTemplate<String, String> kafkaTemplate) {

        var factory = new ConcurrentKafkaListenerContainerFactory<String, String>();
        factory.setConsumerFactory(consumerFactory);

        // DLQ configuration
        factory.setCommonErrorHandler(new DefaultErrorHandler(
            new DeadLetterPublishingRecoverer(kafkaTemplate),
            new FixedBackOff(1000L, 3)  // 3 retries, 1 second between attempts
        ));

        return factory;
    }
}

What Goes to DLQ

Corrupt data (invalid JSON, schema mismatch)
Cannot be processed (database unavailable > max retries)
Business rule violations (negative price, invalid email)
External service errors (API rate limited, timeout)

When to Use

Critical data — messages cannot be lost
Need for analysis — understanding why processing failed
Compliance — audit and traceability
Production — always have a DLQ plan

Middle Level

DLQ Architecture

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│   Producer  │────>│  Main Topic  │────>│  Consumer   │
└─────────────┘     └──────────────┘     └──────┬──────┘
                                                │
                                    ┌───────────┼───────────┐
                                    │           │           │
                               Success    Error → Retry   │
                                                  │        │
                                          Max retries?    │
                                             Yes    │     │
                                                  ▼     │
                                          ┌──────────────┴──┐
                                          │    DLQ Topic     │
                                          │  (orders-dlq)    │
                                          └──────────────────┘

DLQ Message Enrichment

Simply sending the original message to the DLQ is not enough. Context must be added:

public class DLQEnricher {

    public ProducerRecord<String, String> enrichForDLQ(
            ConsumerRecord<String, String> original,
            Exception error) {

        ProducerRecord<String, String> dlqRecord = new ProducerRecord<>(
            original.topic() + "-dlq",  // orders → orders-dlq
            original.key(),
            original.value()
        );

        // Metadata for debugging
        dlqRecord.headers().add("x-original-topic", original.topic().getBytes(StandardCharsets.UTF_8));
        dlqRecord.headers().add("x-original-partition", String.valueOf(original.partition()).getBytes());
        dlqRecord.headers().add("x-original-offset", String.valueOf(original.offset()).getBytes());
        dlqRecord.headers().add("x-original-timestamp", String.valueOf(original.timestamp()).getBytes());
        dlqRecord.headers().add("x-consumer-group", "order-service".getBytes());
        dlqRecord.headers().add("x-error-message", error.getMessage().getBytes(StandardCharsets.UTF_8));
        dlqRecord.headers().add("x-error-stacktrace", getStackTrace(error).getBytes(StandardCharsets.UTF_8));
        dlqRecord.headers().add("x-retry-count", "3".getBytes());
        dlqRecord.headers().add("x-dlq-timestamp", String.valueOf(System.currentTimeMillis()).getBytes());

        return dlqRecord;
    }
}

DLQ Patterns

Pattern	Description	When to Use
Single DLQ	One DLQ topic for all errors	Simple systems
Per-topic DLQ	`topic-dlq` for each topic	Different team responsibilities
Per-error DLQ	`topic-dlq-validation`, `topic-dlq-timeout`	Detailed analysis
Multi-level DLQ	DLQ → DLQ2 → Manual review	High-compliance systems

DLQ Consumer — Processing the DLQ

// DLQ Consumer — reads the DLQ and decides what to do
@KafkaListener(topics = "orders-dlq", groupId = "dlq-handler")
public void handleDLQ(ConsumerRecord<String, String> record) {
    String originalTopic = getHeader(record, "x-original-topic");
    String errorMessage = getHeader(record, "x-error-message");
    String stacktrace = getHeader(record, "x-error-stacktrace");

    // Error classification
    if (errorMessage.contains("Timeout")) {
        // Temporary error — retry
        retryMessage(record);
    } else if (errorMessage.contains("ConstraintViolation")) {
        // Data issue — manual review
        sendToManualReview(record);
    } else if (errorMessage.contains("Schema")) {
        // Schema mismatch — fix producer
        alertProducerTeam(record);
    } else {
        // Unknown — manual review
        sendToManualReview(record);
    }
}

Common Errors Table

Error	Symptoms	Consequences	Solution
Without DLQ	Poison pill blocks consumer	Consumer lag ∞, pipeline stall	Add DLQ
DLQ without monitoring	DLQ grows indefinitely	Disk fill, missed errors	Alert on DLQ size
DLQ without metadata	Impossible to debug	Hours spent on investigation	Enrich headers
DLQ = original topic	Infinite loop	Message bounce between topics	Separate DLQ topic
DLQ without retention	DLQ occupies disk forever	Disk fill	Configure retention
Without DLQ consumer	DLQ messages accumulate	No automatic recovery	DLQ consumer for auto-retry

When NOT to Use DLQ

Non-recoverable data — logs, metrics that can be lost
Idempotent operations — retry is safe to repeat indefinitely
Test environments — can simply be restarted
Very low-volume (< 10 msg/min) — simpler to alert and fix manually
For fully idempotent operations with automatic retry, DLQ may be excessive — alerting on retry exhaustion is sufficient.

Senior Level

Deep Internals

DLQ in Spring Kafka — DeadLetterPublishingRecoverer

// Internal implementation
public class DeadLetterPublishingRecoverer implements ConsumerRecordRecoverer {

    private final Function<ConsumerRecord<?, ?>, TopicPartition> destinationResolver;
    private final KafkaTemplate<Object, Object> template;

    @Override
    public void accept(ConsumerRecord<?, ?> record, Exception exception) {
        // 1. Determine DLQ topic
        TopicPartition dest = destinationResolver.apply(record);

        // 2. Create enriched record
        ProducerRecord<Object, Object> dlqRecord = createProducerRecord(record, exception);

        // 3. Send to DLQ (sync to guarantee delivery)
        template.send(dlqRecord).get();  // Blocking!

        // 4. Log
        log.error("Sent to DLQ: topic={}, partition={}, offset={}",
                  record.topic(), record.partition(), record.offset());
    }

    // Header enrichment
    private ProducerRecord<Object, Object> createProducerRecord(
            ConsumerRecord<?, ?> record, Exception exception) {
        // Copies key, value
        // Adds headers:
        //   - KafkaHeaders.DLT_EXCEPTION_FQCN (exception class name)
        //   - KafkaHeaders.DLT_EXCEPTION_MESSAGE (exception message)
        //   - KafkaHeaders.DLT_ORIGINAL_TOPIC
        //   - KafkaHeaders.DLT_ORIGINAL_PARTITION
        //   - KafkaHeaders.DLT_ORIGINAL_OFFSET
        //   - KafkaHeaders.DLT_ORIGINAL_TIMESTAMP
    }
}

Kafka Streams — Branching for DLQ

// Kafka Streams has no built-in DLQ
// Implement via branching

KStream<String, String> orders = builder.stream("orders");

// Split: valid vs invalid
KStream<String, String>[] branches = orders.branch(
    (key, value) -> isValid(value),   // valid stream
    (key, value) -> !isValid(value)   // invalid → DLQ
);

// Valid → process
branches[0].process(() -> orderProcessor);

// Invalid → DLQ topic
branches[1].mapValues(value -> enrichWithMetadata(value, "validation-error"))
            .to("orders-dlq");

Apache Camel / Kafka Connect DLQ

# Kafka Connect — built-in DLQ support
errors.tolerance=all
errors.deadletterqueue.topic.name=connect-dlq
errors.deadletterqueue.context.headers.enable=true
errors.log.enable=true
errors.log.include.messages=true

# Connect automatically sends failed records to DLQ
# with context headers

Trade-offs

Aspect	Without DLQ	With DLQ
Data safety	Low (poison pill data loss)	High
System availability	Low (stall on error)	High
Complexity	Low	Medium
Debugging time	Hours-days	Minutes
Storage overhead	0%	0.1–1% (DLQ messages)
Operational overhead	High (manual investigation)	Low (automated)

Edge Cases

1. DLQ overflow — when the DLQ itself becomes a problem:

Scenario:
  Producer bug → all messages are invalid
  100K messages/min → DLQ
  DLQ topic: replication factor 3, retention 7 days
  DLQ storage: 100K × 60 × 24 × 7 × 1KB = 10GB/day

Problem: DLQ fills disk faster than main topic!

Resolution:
  1. DLQ retention.ms = 86400000 (1 day, not 7)
  2. DLQ consumer for fast drain
  3. Alert on DLQ rate > threshold
  4. Circuit breaker: stop sending to DLQ, alert only

2. DLQ ordering and causality:

Scenario:
  Partition 0: order-1 (success), order-2 (DLQ), order-3 (success)
  order-3 depends on order-2 (update after create)
  order-2 in DLQ → order-3 processed but fails (no order-2)

Problem: Causality broken — downstream processing depends on DLQ'd message

Resolution:
  1. Saga pattern: order-3 checks prerequisite
  2. Compensating action: if order-2 is in DLQ → rollback order-1
  3. Partition by correlation key — related messages in the same partition

3. DLQ replay — resending from DLQ:

// DLQ replay consumer
@KafkaListener(topics = "orders-dlq", groupId = "dlq-replayer")
public void replay(ConsumerRecord<String, String> record) {
    String errorMessage = getHeader(record, "x-error-message");
    String originalTopic = getHeader(record, "x-original-topic");

    if (errorMessage.contains("Timeout") && isDatabaseHealthy()) {
        // Retry — the error was temporary
        originalProducer.send(new ProducerRecord<>(
            originalTopic, record.key(), record.value()
        ));
        log.info("Replayed from DLQ: {}", record.offset());
    } else {
        // Not retryable — manual review
        log.warn("Non-replayable DLQ message: {}", record.offset());
    }
}
// On repeated error — send to manual-review queue, NOT infinite retry.
// DLQ-for-DLQ pattern: after N retries → a second DLQ → manual intervention.

4. DLQ and Exactly-Once Semantics:

Problem:
  Consumer reads message → error → send to DLQ → commit offset
  Consumer crashes AFTER DLQ send, BEFORE commit
  On restart: message read again → sent to DLQ again → duplicate in DLQ

Resolution:
  1. Transactional DLQ producer:
     transaction {
       process(record);
       if (error) dlqProducer.send(dlqRecord);
       consumer.commitSync();
     }
  2. DLQ deduplication key = original topic + partition + offset
  3. Idempotent DLQ producer

5. Multi-tenant DLQ Isolation:

Scenario:
  SaaS platform, 100 tenants, shared Kafka topics
  Tenant A errors → DLQ
  Tenant B errors → same DLQ
  Tenant C sees DLQ messages from A and B (data leak!)

Resolution:
  Per-tenant DLQ topics:
    orders-tenant-a-dlq
    orders-tenant-b-dlq
  Or DLQ with tenant-id header + ACLs

Performance (Production Numbers)

DLQ Pattern	Overhead	Throughput Impact	Latency Impact	Storage
No DLQ	0%	0%	0ms	0
Sync DLQ send	+10–30ms/msg	-5%	+20ms	0.1–1%
Async DLQ send	+1ms	-1%	+2ms	0.1–1%
DLQ + enrichment	+15–40ms/msg	-8%	+30ms	0.5–2%
Transactional DLQ	+50ms	-15%	+50ms	0.1–1%

Production War Story

Situation: FinTech, transaction processing. A DLQ existed but was not monitored.

Problem: Discovered that the DLQ topic grew to 50M messages over 3 months. 50M transactions of $10–500 each were not processed. Potential financial exposure: $5M+.

Investigation:

DLQ consumer existed but crashed 3 months ago
Nobody noticed (no alerting)
DLQ messages accumulated without processing
Retention = -1 (infinite) → disk fill risk
Root cause: producer schema change, validation messages in DLQ
50M messages = schema violation errors

Resolution:

Immediate: Alert on DLQ size > 1000 messages
Short-term: DLQ consumer restart + replay for retriable errors
Medium: Schema validation on the producer side (prevention)
Long: DLQ auto-classification + auto-retry for transient errors
Operational: DLQ dashboard, on-call rotation for DLQ review

Post-mortem:

DLQ without monitoring = no DLQ (false sense of security)
DLQ retention policy is mandatory
Schema registry for producer validation
Regular DLQ review process (daily)

Monitoring (JMX, Prometheus, Burrow)

JMX metrics (Spring Kafka):

spring.kafka.consumer.dlq.messages.sent.total
spring.kafka.consumer.dlq.messages.sent.rate
spring.kafka.consumer.retry.attempts.total
spring.kafka.consumer.retry.success.total

Prometheus + Grafana:

- record: kafka_dlq_message_rate
  expr: rate(kafka_topic_messages_total{topic=".*-dlq"}[5m])

- alert: KafkaDLQSizeExceeded
  expr: kafka_topic_partition_log_size{topic=".*-dlq"} > 10000
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "DLQ size exceeded 10,000 messages"

- alert: KafkaDLQMessageRateHigh
  expr: rate(kafka_dlq_message_total[5m]) > 10
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "More than 10 messages/minute going to DLQ"

- alert: KafkaDLQConsumerDown
  expr: kafka_consumer_lag{consumer_group="dlq-handler"} > 1000
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "DLQ consumer is falling behind"

- alert: KafkaDLQDiskUsageHigh
  expr: kafka_log_size_bytes{topic=".*-dlq"} / kafka_disk_capacity_bytes > 0.1
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "DLQ using > 10% disk capacity"

Burrow for DLQ consumer monitoring:

GET /v3/kafka/cluster/consumer/dlq-handler/status

{
  "status": "OK",
  "lag": {
    "orders-dlq": {
      "partition_0": {"current_lag": 0},
      "partition_1": {"current_lag": 500}  // DLQ consumer falling behind
    }
  }
}

Highload Best Practices

✅ Separate DLQ topic (don't re-use original topic)
✅ DLQ enrichment with metadata (topic, partition, offset, exception)
✅ DLQ retention policy (1–7 days, not infinite)
✅ Alert on DLQ messages (never silent DLQ)
✅ DLQ consumer for automated retry/replay
✅ DLQ dashboard (message count, error types, age)
✅ Schema validation on the producer side (prevent poison pills)
✅ Per-error-type DLQ routing for automated classification
✅ Idempotent DLQ producer (prevent duplicate DLQ messages)
✅ Circuit breaker on DLQ send rate (prevent DLQ overflow)

❌ Without DLQ in production
❌ DLQ without monitoring/alerting
❌ DLQ without retention policy (disk fill)
❌ DLQ = original topic (infinite loop)
❌ Without DLQ consumer (messages accumulate forever)
❌ Without metadata in DLQ (impossible to debug)
❌ DLQ without enrichment (lost context)

Architectural Decisions

DLQ — not just a topic, but a process — producer → DLQ → consumer → replay → fix
Enrichment is critical — without metadata, DLQ is useless for debugging
Auto-classification — categorize DLQ errors for automated handling
Prevention > cure — schema validation, producer-side checks
DLQ retention — balance between debugging needs and disk usage

Summary for Senior

DLQ — insurance policy against poison pill and data loss
DLQ without monitoring = no DLQ (false sense of security)
Enrichment headers — critical for debugging and replay
DLQ consumer is needed for automated retry/replay capability
Per-error routing → automated classification → reduced manual review
Schema validation on the producer — best way to reduce DLQ volume
DLQ retention policy is mandatory for disk management
Transactional DLQ for exactly-once DLQ semantics
At highload: DLQ rate can become a problem — circuit breaker is needed
DLQ dashboard + on-call process — operational necessity

🎯 Interview Cheat Sheet

Must know:

DLQ — a special topic for messages not processed after max retries
DLQ enrichment: original topic, partition, offset, exception message, stacktrace, timestamp
Patterns: Single DLQ, Per-topic DLQ (topic-dlq), Per-error DLQ, Multi-level DLQ
DLQ consumer: reads DLQ, classifies errors, retry or manual review
DLQ retention policy (1-7 days) — without it, disk fill is guaranteed
DLQ without monitoring = no DLQ (false sense of security)
Transactional DLQ producer prevents duplicate DLQ messages

Common follow-up questions:

Why enrich DLQ messages? — Without metadata, debugging and replay are impossible.
What if the DLQ producer also fails? — Retry limit → manual-review queue.
How to replay from DLQ? — DLQ consumer reads, checks the fix, resends to the original topic.
DLQ = original topic? — NO! Infinite loop: message bounce between topics.

Red flags (DO NOT say):

“DLQ without monitoring is fine” — messages accumulate forever unnoticed
“DLQ without retention — store forever” — disk fill guaranteed
“DLQ = original topic” — infinite retry loop
“DLQ is not needed for idempotent operations” — poison pills still need it

Related topics:

[[24. How to handle errors when reading messages]]
[[26. How to monitor consumer lag]]
[[11. How to configure exactly-once semantics]]
[[10. What is the difference between at-most-once, at-least-once and exactly-once]]