Що таке DLQ (Dead Letter Queue) в Kafka

🟢 Junior Level

Просте визначення

DLQ (Dead Letter Queue) — це спеціальний Kafka топик, куди відправляються повідомлення, які не вдалося обробити після кількох спроб. Це “ізолятор” для проблемних даних.

Аналогія

Уявіть фабрику з пакування подарунків. Працівник бере подарунок, намагається упакувати. Якщо подарунок пошкоджений (зламаний, розбитий):

Спробувати ще раз (retry)
Якщо після 3 спроб все ще зламаний — покласти в кошик “брак” (DLQ)
Продовжити пакувати наступні подарунки

Конвеєр не зупиняється, а браковані подарунки можна пізніше вивчити і полагодити.

Приклад з Spring Kafka

@Configuration
public class KafkaConfig {

    @Bean
    public ConcurrentKafkaListenerContainerFactory<String, String> kafkaListenerContainerFactory(
            ConsumerFactory<String, String> consumerFactory,
            KafkaTemplate<String, String> kafkaTemplate) {

        var factory = new ConcurrentKafkaListenerContainerFactory<String, String>();
        factory.setConsumerFactory(consumerFactory);

        // Налаштування DLQ
        factory.setCommonErrorHandler(new DefaultErrorHandler(
            new DeadLetterPublishingRecoverer(kafkaTemplate),
            new FixedBackOff(1000L, 3)  // 3 retry, 1 секунда між спробами
        ));

        return factory;
    }
}

Що потрапляє в DLQ

Corrupt дані (невалідний JSON, schema mismatch)
Неможливо обробити (БД недоступна > max retries)
Business rule violations (negative price, invalid email)
External service errors (API rate limited, timeout)

Коли використовувати

Критичні дані — не можна втрачати повідомлення
Необхідність аналізу — потрібно зрозуміти чому обробка не вдається
Compliance — аудит і traceability
Production — завжди мати DLQ plan

🟡 Middle Level

Архітектура DLQ

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│   Producer  │────>│  Main Topic  │────>│  Consumer   │
└─────────────┘     └──────────────┘     └──────┬──────┘
                                                │
                                    ┌───────────┼───────────┐
                                    │           │           │
                               Success    Error → Retry   │
                                                  │        │
                                          Max retries?    │
                                             Yes    │     │
                                                  ▼     │
                                          ┌──────────────┴──┐
                                          │    DLQ Topic     │
                                          │  (orders-dlq)    │
                                          └──────────────────┘

DLQ Message Enrichment

Просто відправити оригінальне повідомлення в DLQ — недостатньо. Потрібно додати контекст:

public class DLQEnricher {

    public ProducerRecord<String, String> enrichForDLQ(
            ConsumerRecord<String, String> original,
            Exception error) {

        ProducerRecord<String, String> dlqRecord = new ProducerRecord<>(
            original.topic() + "-dlq",  // orders → orders-dlq
            original.key(),
            original.value()
        );

        // Метадані для debugging
        dlqRecord.headers().add("x-original-topic", original.topic().getBytes(StandardCharsets.UTF_8));
        dlqRecord.headers().add("x-original-partition", String.valueOf(original.partition()).getBytes());
        dlqRecord.headers().add("x-original-offset", String.valueOf(original.offset()).getBytes());
        dlqRecord.headers().add("x-original-timestamp", String.valueOf(original.timestamp()).getBytes());
        dlqRecord.headers().add("x-consumer-group", "order-service".getBytes());
        dlqRecord.headers().add("x-error-message", error.getMessage().getBytes(StandardCharsets.UTF_8));
        dlqRecord.headers().add("x-error-stacktrace", getStackTrace(error).getBytes(StandardCharsets.UTF_8));
        dlqRecord.headers().add("x-retry-count", "3".getBytes());
        dlqRecord.headers().add("x-dlq-timestamp", String.valueOf(System.currentTimeMillis()).getBytes());

        return dlqRecord;
    }
}

DLQ Patterns

Паттерн	Опис	Коли використовувати
Single DLQ	Один DLQ топик для всіх помилок	Прості системи
Per-topic DLQ	`topic-dlq` для кожного topic	Різні команди відповідальності
Per-error DLQ	`topic-dlq-validation`, `topic-dlq-timeout`	Тонкий analysis
Multi-level DLQ	DLQ → DLQ2 → Manual review	High-compliance системи

DLQ Consumer — обробка DLQ

// DLQ Consumer — читає DLQ і вирішує що робити
@KafkaListener(topics = "orders-dlq", groupId = "dlq-handler")
public void handleDLQ(ConsumerRecord<String, String> record) {
    String originalTopic = getHeader(record, "x-original-topic");
    String errorMessage = getHeader(record, "x-error-message");
    String stacktrace = getHeader(record, "x-error-stacktrace");

    // Класифікація помилки
    if (errorMessage.contains("Timeout")) {
        // Тимчасова помилка — retry
        retryMessage(record);
    } else if (errorMessage.contains("ConstraintViolation")) {
        // Data issue — manual review
        sendToManualReview(record);
    } else if (errorMessage.contains("Schema")) {
        // Schema mismatch — fix producer
        alertProducerTeam(record);
    } else {
        // Unknown — manual review
        sendToManualReview(record);
    }
}

Таблиця типових помилок

Помилка	Симптоми	Наслідки	Рішення
Без DLQ	Poison pill блокує consumer	Consumer lag ∞, pipeline stall	Додати DLQ
DLQ без monitoring	DLQ росте до нескінченності	Disk fill, missed errors	Alert на DLQ size
DLQ без metadata	Неможливо debug	Години на investigation	Enrich headers
DLQ = оригінальний topic	Infinite loop	Message bounce між topics	Окремий DLQ topic
DLQ без retention	DLQ займає диск вічно	Disk fill	Налаштувати retention
Без DLQ consumer	DLQ messages накопичуються	Без автоматичного recovery	DLQ consumer для auto-retry

Коли НЕ використовувати DLQ

Non-recoverable дані — логи, метрики які можна втратити
Ідемпотентні операції — retry безпечно повторювати нескінченно
Тестові оточення — можна просто перезапустити
Дуже low-volume (< 10 msg/min) — простіше alert і manual fix
Для повністю ідемпотентних операцій з автоматичним retry DLQ може бути зайвим — достатньо alerting на retry exhaustion.

🔴 Senior Level

Глибокі внутрішності

DLQ в Spring Kafka — DeadLetterPublishingRecoverer

// Internal implementation
public class DeadLetterPublishingRecoverer implements ConsumerRecordRecoverer {

    private final Function<ConsumerRecord<?, ?>, TopicPartition> destinationResolver;
    private final KafkaTemplate<Object, Object> template;

    @Override
    public void accept(ConsumerRecord<?, ?> record, Exception exception) {
        // 1. Determine DLQ topic
        TopicPartition dest = destinationResolver.apply(record);

        // 2. Create enriched record
        ProducerRecord<Object, Object> dlqRecord = createProducerRecord(record, exception);

        // 3. Send to DLQ (sync to guarantee delivery)
        template.send(dlqRecord).get();  // Blocking!

        // 4. Log
        log.error("Sent to DLQ: topic={}, partition={}, offset={}",
                  record.topic(), record.partition(), record.offset());
    }

    // Header enrichment
    private ProducerRecord<Object, Object> createProducerRecord(
            ConsumerRecord<?, ?> record, Exception exception) {
        // Копіює key, value
        // Додає headers:
        //   - KafkaHeaders.DLT_EXCEPTION_FQCN (exception class name)
        //   - KafkaHeaders.DLT_EXCEPTION_MESSAGE (exception message)
        //   - KafkaHeaders.DLT_ORIGINAL_TOPIC
        //   - KafkaHeaders.DLT_ORIGINAL_PARTITION
        //   - KafkaHeaders.DLT_ORIGINAL_OFFSET
        //   - KafkaHeaders.DLT_ORIGINAL_TIMESTAMP
    }
}

Kafka Streams — Branching для DLQ

// Kafka Streams не має built-in DLQ
// Реалізуємо через branching

KStream<String, String> orders = builder.stream("orders");

// Split: valid vs invalid
KStream<String, String>[] branches = orders.branch(
    (key, value) -> isValid(value),   // valid stream
    (key, value) -> !isValid(value)   // invalid → DLQ
);

// Valid → process
branches[0].process(() -> orderProcessor);

// Invalid → DLQ topic
branches[1].mapValues(value -> enrichWithMetadata(value, "validation-error"))
            .to("orders-dlq");

Apache Camel / Kafka Connect DLQ

# Kafka Connect — built-in DLQ support
errors.tolerance=all
errors.deadletterqueue.topic.name=connect-dlq
errors.deadletterqueue.context.headers.enable=true
errors.log.enable=true
errors.log.include.messages=true

# Connect автоматично відправляє failed records в DLQ
# з context headers

Trade-offs

Аспект	Без DLQ	З DLQ
Data safety	Низька (poison pill data loss)	Висока
System availability	Низька (stall on error)	Висока
Complexity	Низька	Середня
Debugging time	Години-дні	Хвилини
Storage overhead	0%	0.1–1% (DLQ messages)
Operational overhead	High (manual investigation)	Low (automated)

Edge Cases

1. DLQ overflow — коли DLQ сам стає проблемою:

Сценарій:
  Producer bug → всі messages invalid
  100K messages/min → DLQ
  DLQ topic: replication factor 3, retention 7 днів
  DLQ storage: 100K × 60 × 24 × 7 × 1KB = 10GB/day

Проблема: DLQ fills disk faster than main topic!

Рішення:
  1. DLQ retention.ms = 86400000 (1 день, не 7)
  2. DLQ consumer для швидкого drain
  3. Alert on DLQ rate > threshold
  4. Circuit breaker: stop sending to DLQ, alert only

2. DLQ ordering і causality:

Сценарій:
  Partition 0: order-1 (success), order-2 (DLQ), order-3 (success)
  order-3 залежить від order-2 (update після create)
  order-2 в DLQ → order-3 processed but fails (no order-2)

Проблема: Causality broken — downstream processing залежить від DLQ'd message

Рішення:
  1. Saga pattern: order-3 checks prerequisite
  2. Compensating action: якщо order-2 в DLQ → rollback order-1
  3. Partition by correlation key — related messages в same partition

3. DLQ replay — перевідправлення з DLQ:

// DLQ replay consumer
@KafkaListener(topics = "orders-dlq", groupId = "dlq-replayer")
public void replay(ConsumerRecord<String, String> record) {
    String errorMessage = getHeader(record, "x-error-message");
    String originalTopic = getHeader(record, "x-original-topic");

    if (errorMessage.contains("Timeout") && isDatabaseHealthy()) {
        // Retry — помилка була тимчасовою
        originalProducer.send(new ProducerRecord<>(
            originalTopic, record.key(), record.value()
        ));
        log.info("Replayed from DLQ: {}", record.offset());
    } else {
        // Не retryable — manual review
        log.warn("Non-replayable DLQ message: {}", record.offset());
    }
}
// При повторній помилці — відправити в manual-review queue, НЕ нескінченний retry.
// DLQ-for-DLQ pattern: після N retry → друга DLQ → manual intervention.

4. DLQ і exactly-once semantics:

Проблема:
  Consumer читає message → error → send to DLQ → commit offset
  Consumer crash AFTER DLQ send, BEFORE commit
  При restart: message прочитан знову → знову в DLQ → duplicate in DLQ

Рішення:
  1. Transactional DLQ producer:
     transaction {
       process(record);
       if (error) dlqProducer.send(dlqRecord);
       consumer.commitSync();
     }
  2. DLQ deduplication key = original topic + partition + offset
  3. Idempotent DLQ producer

Highload Best Practices

✅ Окремий DLQ topic (не re-use original topic)
✅ DLQ enrichment з metadata (topic, partition, offset, exception)
✅ DLQ retention policy (1–7 днів, не infinite)
✅ Alert on DLQ messages (never silent DLQ)
✅ DLQ consumer для automated retry/replay
✅ DLQ dashboard (message count, error types, age)
✅ Schema validation на producer side (prevent poison pills)
✅ Per-error-type DLQ routing для automated classification
✅ Idempotent DLQ producer (prevent duplicate DLQ messages)
✅ Circuit breaker на DLQ send rate (prevent DLQ overflow)

❌ Без DLQ в production
❌ DLQ без monitoring/alerting
❌ DLQ без retention policy (disk fill)
❌ DLQ = оригінальний topic (infinite loop)
❌ Без DLQ consumer (messages accumulate forever)
❌ Без metadata в DLQ (impossible to debug)
❌ DLQ без enrichment (lost context)

Архітектурні рішення

DLQ — не просто топик, а процес — producer → DLQ → consumer → replay → fix
Enrichment critical — без metadata DLQ марний для debugging
Auto-classification — categorize DLQ errors для automated handling
Prevention > cure — schema validation, producer-side checks
DLQ retention — balance між debugging needs і disk usage

Резюме для Senior

DLQ — insurance policy проти poison pill і data loss
DLQ без monitoring = no DLQ (false sense of security)
Enrichment headers — критичні для debugging і replay
DLQ consumer потрібен для automated retry/replay capability
Per-error routing → automated classification → reduced manual review
Schema validation on producer — найкращий спосіб зменшити DLQ volume
DLQ retention policy обов’язкова для disk management
Transactional DLQ для exactly-once DLQ semantics
At highload: DLQ rate може стати проблемою — circuit breaker потрібен
DLQ dashboard + on-call process — operational necessity

🎯 Шпаргалка для інтерв’ю

Обов’язково знати:

DLQ — спеціальний топик для повідомлень, не оброблених після max retries
DLQ enrichment: original topic, partition, offset, exception message, stacktrace, timestamp
Паттерни: Single DLQ, Per-topic DLQ (topic-dlq), Per-error DLQ, Multi-level DLQ
DLQ consumer: читає DLQ, класифікує помилки, retry або manual review
DLQ retention policy (1-7 днів) — без неї disk fill guaranteed
DLQ без monitoring = no DLQ (false sense of security)
Transactional DLQ producer запобігає duplicate DLQ messages

Часті уточнюючі запитання:

Навіщо enrich DLQ messages? — Без metadata неможливі debugging і replay.
Що якщо DLQ producer тоже впаде? — Retry limit → manual-review queue.
Як replay з DLQ? — DLQ consumer читає, перевіряє fix, resends в original topic.
DLQ = original topic? — НІ! Infinite loop: message bounce між topics.

Червоні прапорці (НЕ говорити):

«DLQ без monitoring — нормально» — messages accumulate forever unnoticed
«DLQ без retention — зберігати вічно» — disk fill guaranteed
«DLQ = оригінальний топик» — infinite retry loop
«DLQ не потрібен для ідемпотентних операцій» — poison pill все одно потрібен

Пов’язані теми:

[[24. Як обробляти помилки при читанні повідомлень]]
[[26. Як моніторити lag консьюмера]]
[[11. Як налаштувати exactly-once семантику]]
[[10. У чому різниця між at-most-once, at-least-once і exactly-once]]