Question 25 · Section 15

What is DLQ (Dead Letter Queue) in Kafka

Imagine a gift-wrapping factory. A worker picks up a gift and tries to wrap it. If the gift is damaged (broken, cracked):

Language versions: English Russian Ukrainian

Junior Level

Simple Definition

DLQ (Dead Letter Queue) — a special Kafka topic where messages that could not be processed after several attempts are sent. It is an “isolator” for problematic data.

Analogy

Imagine a gift-wrapping factory. A worker picks up a gift and tries to wrap it. If the gift is damaged (broken, cracked):

  1. Try again (retry)
  2. If after 3 attempts it is still broken — put it in the “defective” basket (DLQ)
  3. Continue wrapping the next gifts

The conveyor doesn’t stop, and the defective gifts can be examined and fixed later.

Example with Spring Kafka

@Configuration
public class KafkaConfig {

    @Bean
    public ConcurrentKafkaListenerContainerFactory<String, String> kafkaListenerContainerFactory(
            ConsumerFactory<String, String> consumerFactory,
            KafkaTemplate<String, String> kafkaTemplate) {

        var factory = new ConcurrentKafkaListenerContainerFactory<String, String>();
        factory.setConsumerFactory(consumerFactory);

        // DLQ configuration
        factory.setCommonErrorHandler(new DefaultErrorHandler(
            new DeadLetterPublishingRecoverer(kafkaTemplate),
            new FixedBackOff(1000L, 3)  // 3 retries, 1 second between attempts
        ));

        return factory;
    }
}

What Goes to DLQ

  • Corrupt data (invalid JSON, schema mismatch)
  • Cannot be processed (database unavailable > max retries)
  • Business rule violations (negative price, invalid email)
  • External service errors (API rate limited, timeout)

When to Use

  • Critical data — messages cannot be lost
  • Need for analysis — understanding why processing failed
  • Compliance — audit and traceability
  • Production — always have a DLQ plan

Middle Level

DLQ Architecture

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│   Producer  │────>│  Main Topic  │────>│  Consumer   │
└─────────────┘     └──────────────┘     └──────┬──────┘
                                                │
                                    ┌───────────┼───────────┐
                                    │           │           │
                               Success    Error → Retry   │
                                                  │        │
                                          Max retries?    │
                                             Yes    │     │
                                                  ▼     │
                                          ┌──────────────┴──┐
                                          │    DLQ Topic     │
                                          │  (orders-dlq)    │
                                          └──────────────────┘

DLQ Message Enrichment

Simply sending the original message to the DLQ is not enough. Context must be added:

public class DLQEnricher {

    public ProducerRecord<String, String> enrichForDLQ(
            ConsumerRecord<String, String> original,
            Exception error) {

        ProducerRecord<String, String> dlqRecord = new ProducerRecord<>(
            original.topic() + "-dlq",  // orders → orders-dlq
            original.key(),
            original.value()
        );

        // Metadata for debugging
        dlqRecord.headers().add("x-original-topic", original.topic().getBytes(StandardCharsets.UTF_8));
        dlqRecord.headers().add("x-original-partition", String.valueOf(original.partition()).getBytes());
        dlqRecord.headers().add("x-original-offset", String.valueOf(original.offset()).getBytes());
        dlqRecord.headers().add("x-original-timestamp", String.valueOf(original.timestamp()).getBytes());
        dlqRecord.headers().add("x-consumer-group", "order-service".getBytes());
        dlqRecord.headers().add("x-error-message", error.getMessage().getBytes(StandardCharsets.UTF_8));
        dlqRecord.headers().add("x-error-stacktrace", getStackTrace(error).getBytes(StandardCharsets.UTF_8));
        dlqRecord.headers().add("x-retry-count", "3".getBytes());
        dlqRecord.headers().add("x-dlq-timestamp", String.valueOf(System.currentTimeMillis()).getBytes());

        return dlqRecord;
    }
}

DLQ Patterns

Pattern Description When to Use
Single DLQ One DLQ topic for all errors Simple systems
Per-topic DLQ topic-dlq for each topic Different team responsibilities
Per-error DLQ topic-dlq-validation, topic-dlq-timeout Detailed analysis
Multi-level DLQ DLQ → DLQ2 → Manual review High-compliance systems

DLQ Consumer — Processing the DLQ

// DLQ Consumer — reads the DLQ and decides what to do
@KafkaListener(topics = "orders-dlq", groupId = "dlq-handler")
public void handleDLQ(ConsumerRecord<String, String> record) {
    String originalTopic = getHeader(record, "x-original-topic");
    String errorMessage = getHeader(record, "x-error-message");
    String stacktrace = getHeader(record, "x-error-stacktrace");

    // Error classification
    if (errorMessage.contains("Timeout")) {
        // Temporary error — retry
        retryMessage(record);
    } else if (errorMessage.contains("ConstraintViolation")) {
        // Data issue — manual review
        sendToManualReview(record);
    } else if (errorMessage.contains("Schema")) {
        // Schema mismatch — fix producer
        alertProducerTeam(record);
    } else {
        // Unknown — manual review
        sendToManualReview(record);
    }
}

Common Errors Table

Error Symptoms Consequences Solution
Without DLQ Poison pill blocks consumer Consumer lag ∞, pipeline stall Add DLQ
DLQ without monitoring DLQ grows indefinitely Disk fill, missed errors Alert on DLQ size
DLQ without metadata Impossible to debug Hours spent on investigation Enrich headers
DLQ = original topic Infinite loop Message bounce between topics Separate DLQ topic
DLQ without retention DLQ occupies disk forever Disk fill Configure retention
Without DLQ consumer DLQ messages accumulate No automatic recovery DLQ consumer for auto-retry

When NOT to Use DLQ

  • Non-recoverable data — logs, metrics that can be lost
  • Idempotent operations — retry is safe to repeat indefinitely
  • Test environments — can simply be restarted
  • Very low-volume (< 10 msg/min) — simpler to alert and fix manually
  • For fully idempotent operations with automatic retry, DLQ may be excessive — alerting on retry exhaustion is sufficient.

Senior Level

Deep Internals

DLQ in Spring Kafka — DeadLetterPublishingRecoverer

// Internal implementation
public class DeadLetterPublishingRecoverer implements ConsumerRecordRecoverer {

    private final Function<ConsumerRecord<?, ?>, TopicPartition> destinationResolver;
    private final KafkaTemplate<Object, Object> template;

    @Override
    public void accept(ConsumerRecord<?, ?> record, Exception exception) {
        // 1. Determine DLQ topic
        TopicPartition dest = destinationResolver.apply(record);

        // 2. Create enriched record
        ProducerRecord<Object, Object> dlqRecord = createProducerRecord(record, exception);

        // 3. Send to DLQ (sync to guarantee delivery)
        template.send(dlqRecord).get();  // Blocking!

        // 4. Log
        log.error("Sent to DLQ: topic={}, partition={}, offset={}",
                  record.topic(), record.partition(), record.offset());
    }

    // Header enrichment
    private ProducerRecord<Object, Object> createProducerRecord(
            ConsumerRecord<?, ?> record, Exception exception) {
        // Copies key, value
        // Adds headers:
        //   - KafkaHeaders.DLT_EXCEPTION_FQCN (exception class name)
        //   - KafkaHeaders.DLT_EXCEPTION_MESSAGE (exception message)
        //   - KafkaHeaders.DLT_ORIGINAL_TOPIC
        //   - KafkaHeaders.DLT_ORIGINAL_PARTITION
        //   - KafkaHeaders.DLT_ORIGINAL_OFFSET
        //   - KafkaHeaders.DLT_ORIGINAL_TIMESTAMP
    }
}

Kafka Streams — Branching for DLQ

// Kafka Streams has no built-in DLQ
// Implement via branching

KStream<String, String> orders = builder.stream("orders");

// Split: valid vs invalid
KStream<String, String>[] branches = orders.branch(
    (key, value) -> isValid(value),   // valid stream
    (key, value) -> !isValid(value)   // invalid → DLQ
);

// Valid → process
branches[0].process(() -> orderProcessor);

// Invalid → DLQ topic
branches[1].mapValues(value -> enrichWithMetadata(value, "validation-error"))
            .to("orders-dlq");

Apache Camel / Kafka Connect DLQ

# Kafka Connect — built-in DLQ support
errors.tolerance=all
errors.deadletterqueue.topic.name=connect-dlq
errors.deadletterqueue.context.headers.enable=true
errors.log.enable=true
errors.log.include.messages=true

# Connect automatically sends failed records to DLQ
# with context headers

Trade-offs

Aspect Without DLQ With DLQ
Data safety Low (poison pill data loss) High
System availability Low (stall on error) High
Complexity Low Medium
Debugging time Hours-days Minutes
Storage overhead 0% 0.1–1% (DLQ messages)
Operational overhead High (manual investigation) Low (automated)

Edge Cases

1. DLQ overflow — when the DLQ itself becomes a problem:

Scenario:
  Producer bug → all messages are invalid
  100K messages/min → DLQ
  DLQ topic: replication factor 3, retention 7 days
  DLQ storage: 100K × 60 × 24 × 7 × 1KB = 10GB/day

Problem: DLQ fills disk faster than main topic!

Resolution:
  1. DLQ retention.ms = 86400000 (1 day, not 7)
  2. DLQ consumer for fast drain
  3. Alert on DLQ rate > threshold
  4. Circuit breaker: stop sending to DLQ, alert only

2. DLQ ordering and causality:

Scenario:
  Partition 0: order-1 (success), order-2 (DLQ), order-3 (success)
  order-3 depends on order-2 (update after create)
  order-2 in DLQ → order-3 processed but fails (no order-2)

Problem: Causality broken — downstream processing depends on DLQ'd message

Resolution:
  1. Saga pattern: order-3 checks prerequisite
  2. Compensating action: if order-2 is in DLQ → rollback order-1
  3. Partition by correlation key — related messages in the same partition

3. DLQ replay — resending from DLQ:

// DLQ replay consumer
@KafkaListener(topics = "orders-dlq", groupId = "dlq-replayer")
public void replay(ConsumerRecord<String, String> record) {
    String errorMessage = getHeader(record, "x-error-message");
    String originalTopic = getHeader(record, "x-original-topic");

    if (errorMessage.contains("Timeout") && isDatabaseHealthy()) {
        // Retry — the error was temporary
        originalProducer.send(new ProducerRecord<>(
            originalTopic, record.key(), record.value()
        ));
        log.info("Replayed from DLQ: {}", record.offset());
    } else {
        // Not retryable — manual review
        log.warn("Non-replayable DLQ message: {}", record.offset());
    }
}
// On repeated error — send to manual-review queue, NOT infinite retry.
// DLQ-for-DLQ pattern: after N retries → a second DLQ → manual intervention.

4. DLQ and Exactly-Once Semantics:

Problem:
  Consumer reads message → error → send to DLQ → commit offset
  Consumer crashes AFTER DLQ send, BEFORE commit
  On restart: message read again → sent to DLQ again → duplicate in DLQ

Resolution:
  1. Transactional DLQ producer:
     transaction {
       process(record);
       if (error) dlqProducer.send(dlqRecord);
       consumer.commitSync();
     }
  2. DLQ deduplication key = original topic + partition + offset
  3. Idempotent DLQ producer

5. Multi-tenant DLQ Isolation:

Scenario:
  SaaS platform, 100 tenants, shared Kafka topics
  Tenant A errors → DLQ
  Tenant B errors → same DLQ
  Tenant C sees DLQ messages from A and B (data leak!)

Resolution:
  Per-tenant DLQ topics:
    orders-tenant-a-dlq
    orders-tenant-b-dlq
  Or DLQ with tenant-id header + ACLs

Performance (Production Numbers)

DLQ Pattern Overhead Throughput Impact Latency Impact Storage
No DLQ 0% 0% 0ms 0
Sync DLQ send +10–30ms/msg -5% +20ms 0.1–1%
Async DLQ send +1ms -1% +2ms 0.1–1%
DLQ + enrichment +15–40ms/msg -8% +30ms 0.5–2%
Transactional DLQ +50ms -15% +50ms 0.1–1%

Production War Story

Situation: FinTech, transaction processing. A DLQ existed but was not monitored.

Problem: Discovered that the DLQ topic grew to 50M messages over 3 months. 50M transactions of $10–500 each were not processed. Potential financial exposure: $5M+.

Investigation:

  • DLQ consumer existed but crashed 3 months ago
  • Nobody noticed (no alerting)
  • DLQ messages accumulated without processing
  • Retention = -1 (infinite) → disk fill risk
  • Root cause: producer schema change, validation messages in DLQ
  • 50M messages = schema violation errors

Resolution:

  1. Immediate: Alert on DLQ size > 1000 messages
  2. Short-term: DLQ consumer restart + replay for retriable errors
  3. Medium: Schema validation on the producer side (prevention)
  4. Long: DLQ auto-classification + auto-retry for transient errors
  5. Operational: DLQ dashboard, on-call rotation for DLQ review

Post-mortem:

  • DLQ without monitoring = no DLQ (false sense of security)
  • DLQ retention policy is mandatory
  • Schema registry for producer validation
  • Regular DLQ review process (daily)

Monitoring (JMX, Prometheus, Burrow)

JMX metrics (Spring Kafka):

spring.kafka.consumer.dlq.messages.sent.total
spring.kafka.consumer.dlq.messages.sent.rate
spring.kafka.consumer.retry.attempts.total
spring.kafka.consumer.retry.success.total

Prometheus + Grafana:

- record: kafka_dlq_message_rate
  expr: rate(kafka_topic_messages_total{topic=".*-dlq"}[5m])

- alert: KafkaDLQSizeExceeded
  expr: kafka_topic_partition_log_size{topic=".*-dlq"} > 10000
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "DLQ size exceeded 10,000 messages"

- alert: KafkaDLQMessageRateHigh
  expr: rate(kafka_dlq_message_total[5m]) > 10
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "More than 10 messages/minute going to DLQ"

- alert: KafkaDLQConsumerDown
  expr: kafka_consumer_lag{consumer_group="dlq-handler"} > 1000
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "DLQ consumer is falling behind"

- alert: KafkaDLQDiskUsageHigh
  expr: kafka_log_size_bytes{topic=".*-dlq"} / kafka_disk_capacity_bytes > 0.1
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "DLQ using > 10% disk capacity"

Burrow for DLQ consumer monitoring:

GET /v3/kafka/cluster/consumer/dlq-handler/status

{
  "status": "OK",
  "lag": {
    "orders-dlq": {
      "partition_0": {"current_lag": 0},
      "partition_1": {"current_lag": 500}  // DLQ consumer falling behind
    }
  }
}

Highload Best Practices

✅ Separate DLQ topic (don't re-use original topic)
✅ DLQ enrichment with metadata (topic, partition, offset, exception)
✅ DLQ retention policy (1–7 days, not infinite)
✅ Alert on DLQ messages (never silent DLQ)
✅ DLQ consumer for automated retry/replay
✅ DLQ dashboard (message count, error types, age)
✅ Schema validation on the producer side (prevent poison pills)
✅ Per-error-type DLQ routing for automated classification
✅ Idempotent DLQ producer (prevent duplicate DLQ messages)
✅ Circuit breaker on DLQ send rate (prevent DLQ overflow)

❌ Without DLQ in production
❌ DLQ without monitoring/alerting
❌ DLQ without retention policy (disk fill)
❌ DLQ = original topic (infinite loop)
❌ Without DLQ consumer (messages accumulate forever)
❌ Without metadata in DLQ (impossible to debug)
❌ DLQ without enrichment (lost context)

Architectural Decisions

  1. DLQ — not just a topic, but a process — producer → DLQ → consumer → replay → fix
  2. Enrichment is critical — without metadata, DLQ is useless for debugging
  3. Auto-classification — categorize DLQ errors for automated handling
  4. Prevention > cure — schema validation, producer-side checks
  5. DLQ retention — balance between debugging needs and disk usage

Summary for Senior

  • DLQ — insurance policy against poison pill and data loss
  • DLQ without monitoring = no DLQ (false sense of security)
  • Enrichment headers — critical for debugging and replay
  • DLQ consumer is needed for automated retry/replay capability
  • Per-error routing → automated classification → reduced manual review
  • Schema validation on the producer — best way to reduce DLQ volume
  • DLQ retention policy is mandatory for disk management
  • Transactional DLQ for exactly-once DLQ semantics
  • At highload: DLQ rate can become a problem — circuit breaker is needed
  • DLQ dashboard + on-call process — operational necessity

🎯 Interview Cheat Sheet

Must know:

  • DLQ — a special topic for messages not processed after max retries
  • DLQ enrichment: original topic, partition, offset, exception message, stacktrace, timestamp
  • Patterns: Single DLQ, Per-topic DLQ (topic-dlq), Per-error DLQ, Multi-level DLQ
  • DLQ consumer: reads DLQ, classifies errors, retry or manual review
  • DLQ retention policy (1-7 days) — without it, disk fill is guaranteed
  • DLQ without monitoring = no DLQ (false sense of security)
  • Transactional DLQ producer prevents duplicate DLQ messages

Common follow-up questions:

  • Why enrich DLQ messages? — Without metadata, debugging and replay are impossible.
  • What if the DLQ producer also fails? — Retry limit → manual-review queue.
  • How to replay from DLQ? — DLQ consumer reads, checks the fix, resends to the original topic.
  • DLQ = original topic? — NO! Infinite loop: message bounce between topics.

Red flags (DO NOT say):

  • “DLQ without monitoring is fine” — messages accumulate forever unnoticed
  • “DLQ without retention — store forever” — disk fill guaranteed
  • “DLQ = original topic” — infinite retry loop
  • “DLQ is not needed for idempotent operations” — poison pills still need it

Related topics:

  • [[24. How to handle errors when reading messages]]
  • [[26. How to monitor consumer lag]]
  • [[11. How to configure exactly-once semantics]]
  • [[10. What is the difference between at-most-once, at-least-once and exactly-once]]