April 18, 2026

Handling Failed Kafka Events with Retry Topic and DLQ in Spring Boot

In the previous article, we understood why consumers should be idempotent when the same event is delivered more than once.

But there is another important problem in event-driven systems.

Sometimes the consumer receives an event, but it cannot process it successfully.

This can happen because of:

temporary database failure
downstream service timeout
network issue
invalid event data
a bug in business logic

If we simply fail the consumer again and again, the same message may keep blocking the processing flow.

If we ignore the failure, we may lose an important event.

So how do we handle failed events safely?

This is exactly the problem that the Retry Topic + DLQ Pattern helps us solve.

In this article, we will understand:

what problem happens when Kafka event processing fails
how retry topic helps us retry failed events
how DLQ helps us store permanently failed events
how to implement this flow using Spring Boot and Kafka

The Problem

Let us take a simple example.

We have an order event:

{
  "orderId": "order-1001",
  "customerId": "customer-42",
  "amount": 499.99
}

This event is published to Kafka.

The consumer receives the event and tries to process it.

But suppose processing fails because the database is temporarily down.

Now we have a question:

What should happen to this failed event?

We cannot simply ignore it.

At the same time, retrying forever is also not a good idea.

There is one more important problem.

If the consumer keeps failing and the offset is not committed, Kafka can keep delivering the same message again.

For that partition, messages behind this failed message may remain pending.

So one bad event can delay other valid business events.

Example:

order-1001 has bad data and keeps failing
order-1002 and order-1003 are waiting behind it
downstream processing is delayed
business flow becomes stuck for that partition

This kind of continuously failing message is often called a poison message.

What Can Go Wrong?

If failed events are not handled carefully, we can face multiple problems.

Case 1: Event is lost

If the consumer catches the exception and does nothing, the event may be lost.

That means the order event was received, but the business action never happened.

Case 2: Event keeps failing forever

If the same event is retried again and again without any limit, it can keep failing forever.

This can waste resources and make debugging harder.

Case 3: One bad event blocks other events

If one event has bad data and keeps failing, it may slow down or block processing for other valid events in the same partition.

That means one bad event can delay many good events.

So instead of keeping the failed event in the main processing path, So we need a controlled retry flow.

This keeps the main topic cleaner and gives us a separate place to handle failures.

Retry Topic + DLQ Pattern Idea

The idea is simple.

Instead of retrying the failed event immediately in the same consumer flow:

send the failed event to a retry topic
track how many times it has been retried
try processing it again from the retry topic
if it still fails after max attempts, send it to a DLQ

DLQ means Dead Letter Queue.

It is a separate topic where permanently failed events are stored for later inspection.

So the failed event is not lost.

It is moved to a place where we can monitor it, debug it, and decide what to do next.

High-Level Flow

In this demo, we use three Kafka topics:

order-events
order-events-retry
order-events-dlq

The flow is:

API publishes OrderPlacedEvent to order-events
Main consumer tries to process the event
If processing fails, event is sent to order-events-retry
Retry consumer reads the event and checks retry count
If retry count is below max attempts, event is sent back to retry topic
If max attempts are reached, event is sent to order-events-dlq
DLQ consumer logs the failed event

flowchart TD
    A[Client calls POST /api/events/orders] --> B[Publish OrderPlacedEvent]
    B --> C[order-events topic]
    C --> D[Main Consumer]
    D --> E{Processing Success?}
    E -- Yes --> F[Save processed order in H2]
    E -- No --> G[Send to order-events-retry]
    G --> H[Retry Consumer]
    H --> I{Max retries reached?}
    I -- No --> G
    I -- Yes --> J[Send to order-events-dlq]
    J --> K[DLQ Consumer logs failed event]

Implementation

To understand this better, we have built a small Spring Boot application.

This application keeps the retry logic manual and explicit so the flow is easy to understand.

1) Kafka Topics

KafkaTopicConfig creates three topics:

@Bean
public NewTopic orderEventsTopic() {
    return TopicBuilder.name(mainTopic)
            .partitions(1)
            .replicas(1)
            .build();
}

@Bean
public NewTopic orderEventsRetryTopic() {
    return TopicBuilder.name(retryTopic)
            .partitions(1)
            .replicas(1)
            .build();
}

@Bean
public NewTopic orderEventsDlqTopic() {
    return TopicBuilder.name(dlqTopic)
            .partitions(1)
            .replicas(1)
            .build();
}

The topic names come from application.yml:

app:
  topics:
    main: order-events
    retry: order-events-retry
    dlq: order-events-dlq

This keeps the topic names configurable.

2) Publishing the Order Event

The API endpoint is:

POST /api/events/orders

OrderController accepts the request and calls the service:

@PostMapping("/orders")
@ResponseStatus(HttpStatus.ACCEPTED)
public EventPublishResponse publishOrderEvent(@Valid @RequestBody CreateOrderRequest request) {
    return kafkaRetryDemoService.publish(request);
}

The request body contains:

{
  "orderId": "order-1001",
  "customerId": "customer-42",
  "amount": 499.99
}

In the service, a new OrderPlacedEvent is created with a generated eventId:

public EventPublishResponse publish(CreateOrderRequest request) {
    OrderPlacedEvent event = new OrderPlacedEvent(
            UUID.randomUUID().toString(),
            request.orderId(),
            request.customerId(),
            request.amount()
    );
    sendToMainTopic(event);
    return new EventPublishResponse(
            "Order event published to Kafka for asynchronous processing",
            event.eventId(),
            mainTopic
    );
}

The event is first sent to the main topic:

private void sendToMainTopic(OrderPlacedEvent event) {
    log.info("Publishing eventId={} to main topic={}", event.eventId(), mainTopic);
    kafkaTemplate.send(buildMessage(mainTopic, event, 0));
}

3) Main Consumer

The main consumer listens to order-events:

@KafkaListener(topics = "${app.topics.main}", groupId = "order-events-main-consumer")
public void consumeMainTopic(OrderPlacedEvent event) {
    log.info("Received event on main topic. eventId={}, orderId={}", event.eventId(), event.orderId());

    try {
        process(event);
    } catch (RuntimeException ex) {
        log.warn("Failure occurred on main topic for eventId={}. Sending to retry topic. reason={}",
                event.eventId(), ex.getMessage());
        sendToRetryTopic(event, 1);
    }
}

The logic is simple:

receive event from main topic
try to process it
if processing fails, send it to retry topic
retry count starts from 1

4) Retry Count Header

To track retry attempts, this demo uses a Kafka header:

private static final String RETRY_COUNT_HEADER = "x-retry-count";

Whenever an event is sent to Kafka, the retry count is added as a header:

private Message<OrderPlacedEvent> buildMessage(String topic, OrderPlacedEvent event, int retryCount) {
    return MessageBuilder.withPayload(event)
            .setHeader(KafkaHeaders.TOPIC, topic)
            .setHeader(KafkaHeaders.KEY, event.eventId())
            .setHeader(RETRY_COUNT_HEADER, String.valueOf(retryCount))
            .build();
}

This header helps the retry consumer decide whether to retry again or move the event to DLQ.

5) Retry Consumer

The retry consumer listens to order-events-retry.

@KafkaListener(topics = "${app.topics.retry}", groupId = "order-events-retry-consumer")
public void consumeRetryTopic(ConsumerRecord<String, OrderPlacedEvent> record) {
    OrderPlacedEvent event = record.value();
    int retryCount = readRetryCount(record);

    log.info("Received event on retry topic. eventId={}, orderId={}, retryCount={}",
            event.eventId(), event.orderId(), retryCount);

    try {
        process(event);
    } catch (RuntimeException ex) {
        log.warn("Failure occurred on retry topic for eventId={}. retryCount={}, reason={}",
                event.eventId(), retryCount, ex.getMessage());

        if (retryCount >= maxAttempts) {
            log.warn("Max retries reached for eventId={}. Sending to DLQ.", event.eventId());
            sendToDlq(event, retryCount);
            return;
        }

        int nextRetryCount = retryCount + 1;
        log.info("Retry attempt count increased for eventId={} to {}", event.eventId(), nextRetryCount);
        sendToRetryTopic(event, nextRetryCount);
    }
}

The retry consumer does three important things:

reads the retry count from the header
tries to process the event again
sends the event to DLQ when max retries are reached

In this demo, max attempts are configured like this:

app:
  retry:
    max-attempts: 3

6) Reading the Retry Count

The retry count is read from the Kafka header:

private int readRetryCount(ConsumerRecord<String, OrderPlacedEvent> record) {
    Header header = record.headers().lastHeader(RETRY_COUNT_HEADER);
    if (header == null || header.value() == null || header.value().length == 0) {
        return 0;
    }

    return Integer.parseInt(new String(header.value(), StandardCharsets.UTF_8));
}

If the header is missing, the retry count is treated as 0.

This makes the retry flow easier to handle safely.

7) Sending to DLQ

Once max retries are reached, the event is sent to the DLQ topic:

private void sendToDlq(OrderPlacedEvent event, int retryCount) {
    log.info("Sending eventId={} to DLQ topic={} after retryCount={}",
            event.eventId(), dlqTopic, retryCount);
    kafkaTemplate.send(buildMessage(dlqTopic, event, retryCount));
}

The DLQ consumer logs the failed event:

@KafkaListener(topics = "${app.topics.dlq}", groupId = "order-events-dlq-consumer")
public void consumeDlqTopic(OrderPlacedEvent event) {
    log.error("Received event in DLQ. eventId={}, orderId={}", event.eventId(), event.orderId());
}

In real production systems, the DLQ is not just for logging.

Usually, teams monitor DLQ events and decide whether to:

fix the bad data
replay the event
alert the owning team
investigate a downstream failure

Processing Logic

To make the retry behavior easy to test, this project has a flag:

app:
  processing:
    fail-enabled: true

When this value is true, processing always fails:

void process(OrderPlacedEvent event) {
    log.info("Processing eventId={}, orderId={}", event.eventId(), event.orderId());

    if (failEnabled) {
        throw new RuntimeException("Demo failure is enabled by app.processing.fail-enabled=true");
    }

    if (processedOrderRepository.existsByEventId(event.eventId())) {
        log.info("Order already processed for eventId={}. Skipping duplicate success path.", event.eventId());
        return;
    }

    ProcessedOrder processedOrder = new ProcessedOrder(
            event.eventId(),
            event.orderId(),
            event.customerId(),
            event.amount(),
            LocalDateTime.now()
    );
    processedOrderRepository.save(processedOrder);
    log.info("Successfully processed eventId={}, orderId={} and saved it to H2",
            event.eventId(), event.orderId());
}

When fail-enabled=false, the event is processed successfully and saved into H2.

The ProcessedOrder entity stores:

eventId
orderId
customerId
amount
processedAt

One important detail is that eventId is unique:

@Column(nullable = false, unique = true)
private String eventId;

This helps avoid duplicate successful processing for the same event.

End-to-End Testing (Local)

For this example, Kafka is expected to run locally on:

localhost:9092

The Spring Boot app runs on:

http://localhost:8082

Start the application:

mvn -pl retry-topic-dlq-pattern spring-boot:run

Now publish an order event:

curl -X POST http://localhost:8082/api/events/orders \
  -H "Content-Type: application/json" \
  -d '{
    "orderId":"order-1001",
    "customerId":"customer-42",
    "amount":499.99
  }'

Response:

{
    "message": "Order event published to Kafka for asynchronous processing",
    "eventId": "67c92559-f79a-42fc-a882-f172c1b9f935",
    "topic": "order-events"
}

Scenario 1: Processing Succeeds

Set:

app:
  processing:
    fail-enabled: false

Expected flow:

event is published to order-events
main consumer receives it
processing succeeds
row is inserted into H2

In this case, retry topic and DLQ are not used.

After successful processing, we can see one row in the PROCESSED_ORDER table:

H2 processed order table showing one successfully processed Kafka event

Scenario 2: Processing Fails and Goes to DLQ

Set:

app:
  processing:
    fail-enabled: true

Expected flow:

main consumer receives the event
processing fails
event is sent to order-events-retry with x-retry-count=1
retry consumer receives it
retry count increases
after max attempts, event is sent to order-events-dlq
DLQ consumer logs the failed event

This clearly shows that the failed event is not lost.

It is moved to DLQ after controlled retry attempts.

Logs To Observe The Flow

The application logs the important steps clearly.

When processing fails and the event moves through retry and DLQ, we can see the full flow in the logs:

Received event on main topic. eventId=67c92559-f79a-42fc-a882-f172c1b9f935, orderId=order-1001
Processing eventId=67c92559-f79a-42fc-a882-f172c1b9f935, orderId=order-1001

Failure occurred on main topic for eventId=67c92559-f79a-42fc-a882-f172c1b9f935. Sending to retry topic. reason=Demo failure is enabled by app.processing.fail-enabled=true

Sending eventId=67c92559-f79a-42fc-a882-f172c1b9f935 to retry topic=order-events-retry with retryCount=1

Received event on retry topic. eventId=67c92559-f79a-42fc-a882-f172c1b9f935, orderId=order-1001, retryCount=1

Processing eventId=67c92559-f79a-42fc-a882-f172c1b9f935, orderId=order-1001
Failure occurred on retry topic for eventId=67c92559-f79a-42fc-a882-f172c1b9f935. retryCount=1, reason=Demo failure is enabled by app.processing.fail-enabled=true

Retry attempt count increased for eventId=67c92559-f79a-42fc-a882-f172c1b9f935 to 2
Sending eventId=67c92559-f79a-42fc-a882-f172c1b9f935 to retry topic=order-events-retry with retryCount=2

Received event on retry topic. eventId=67c92559-f79a-42fc-a882-f172c1b9f935, orderId=order-1001, retryCount=2
Processing eventId=67c92559-f79a-42fc-a882-f172c1b9f935, orderId=order-1001

Failure occurred on retry topic for eventId=67c92559-f79a-42fc-a882-f172c1b9f935. retryCount=2, reason=Demo failure is enabled by app.processing.fail-enabled=true
Retry attempt count increased for eventId=67c92559-f79a-42fc-a882-f172c1b9f935 to 3

Sending eventId=67c92559-f79a-42fc-a882-f172c1b9f935 to retry topic=order-events-retry with retryCount=3
Received event on retry topic. eventId=67c92559-f79a-42fc-a882-f172c1b9f935, orderId=order-1001, retryCount=3

Processing eventId=67c92559-f79a-42fc-a882-f172c1b9f935, orderId=order-1001
Failure occurred on retry topic for eventId=67c92559-f79a-42fc-a882-f172c1b9f935. retryCount=3, reason=Demo failure is enabled by app.processing.fail-enabled=true

Max retries reached for eventId=67c92559-f79a-42fc-a882-f172c1b9f935. Sending to DLQ.
Sending eventId=67c92559-f79a-42fc-a882-f172c1b9f935 to DLQ topic=order-events-dlq after retryCount=3
Received event in DLQ. eventId=67c92559-f79a-42fc-a882-f172c1b9f935, orderId=order-1001

These logs make it easy to verify:

event first went to the main topic
processing failed
event moved to retry topic
retry count increased
event finally moved to DLQ

Why Not Retry Forever?

Retrying forever may sound simple, but it creates problems.

If the event has bad data, retrying will never fix it.

If the downstream system is down for a long time, endless retries can create unnecessary load.

If many events fail, retrying everything continuously can make the system unstable.

That is why we need:

limited retry attempts
clear retry tracking
DLQ for failed events
monitoring and alerting

What This Solves

Retry Topic + DLQ Pattern helps us handle failed events in a controlled way.

It gives us:

retry for temporary failures
protection from endless retry loops
a separate place for permanently failed events
better debugging and monitoring

This is very useful in event-driven systems where message processing can fail due to temporary or permanent issues.

What Is Still Required for Production

This demo keeps the implementation simple for learning.

In production systems, we usually need more things:

retry delay or backoff
separate retry topics for different delays
DLQ monitoring and alerting
replay mechanism from DLQ
better error classification
idempotent consumers
tracing and correlation IDs

Retry With Backoff

In this demo, retry happens immediately.

This keeps the implementation simple and easy to understand.

But in production, immediate retry is not always useful.

If the database is down, retrying again within milliseconds will probably fail again.

So production systems usually use retry with backoff.

Example:

first retry after 10 seconds
second retry after 1 minute
third retry after 5 minutes

This gives temporary failures some time to recover.

Some systems use separate retry topics for each delay window.

For example:

order-events-retry-10s
order-events-retry-1m
order-events-retry-5m

Ordering Consideration

Retry topics also introduce one important tradeoff.

If strict ordering is required for events with the same key, retry design needs extra care.

Moving one failed event to a retry topic can allow later events to continue in the main topic.

That is good for availability.

But it may change the original processing order.

So if ordering is very important, the retry strategy should be designed based on the business requirement.

Spring Kafka Support

This demo implements the retry flow manually so the idea is easy to understand.

Spring Kafka also provides retry topic support, which can reduce some boilerplate in production applications.

But understanding the manual flow first makes the pattern much clearer.

One important point:

Retry and DLQ do not replace idempotency.

If the same event is retried or replayed, the consumer should still be safe to process duplicates.

So Retry Topic + DLQ and Idempotent Consumer solve different problems.

Retry Topic + DLQ vs Idempotent Consumer

These two patterns are related, but they are not the same.

Retry Topic + DLQ

handles failed processing
retries temporary failures
moves permanently failed events to DLQ

Idempotent Consumer

handles duplicate event delivery
prevents duplicate business side effects
makes retry and replay safer

In real systems, we often need both.

Conclusion

Failures are normal in event-driven systems.

A Kafka consumer may fail because of database issues, downstream timeouts, bad data, or application bugs.

Retry Topic + DLQ Pattern gives us a practical way to handle these failures.

Simple summary:

process event from main topic
if it fails, send it to retry topic
track retry count using a header
retry only up to a configured limit
after max retries, move the event to DLQ
monitor and debug DLQ events later

This gives us a much cleaner and safer event-processing flow.

Repository for this demo: backend-patterns-and-practices/retry-topic-dlq-pattern.