Building Webhook Replay Systems That Actually Work

Webhooks fail. The receiving server is down, the network hiccups, the endpoint returns a 500. This is not exceptional — it is expected. Any webhook system that does not handle failed deliveries is a w

Webhooks fail. The receiving server is down, the network hiccups, the endpoint returns a 500. This is not exceptional — it is expected. Any webhook system that does not handle failed deliveries is a webhook system that will lose data.

Here is how to build webhook replay that works in production, based on patterns we use in WebhookVault.

The Retry Strategy

The temptation is to retry immediately. The server was down, maybe it will be up in a second. But immediate retries cause thundering herds: if a server goes down and 10,000 webhooks retry simultaneously, the server comes back up and is immediately overwhelmed by the retry storm.

Use exponential backoff with jitter:

import random

def next_retry_delay(attempt):
    base_delay = min(300, 2 ** attempt)  # Cap at 5 minutes
    jitter = random.uniform(0, base_delay * 0.1)
    return base_delay + jitter

# Attempt 1: ~2 seconds
# Attempt 2: ~4 seconds
# Attempt 3: ~8 seconds
# Attempt 4: ~16 seconds
# Attempt 5: ~32 seconds
# ...
# Attempt 9+: ~300 seconds (capped)

The jitter is critical. Without it, all failed webhooks retry at exactly the same intervals, creating synchronized bursts. With jitter, retries spread out naturally.

How Many Retries?

Too few and you miss transient failures. Too many and you waste resources on permanently dead endpoints. Most production webhook systems use 5-8 retries over 24-48 hours:

  • Retry 1: 2 seconds (catches instant blips)
  • Retry 2: 30 seconds (catches restarts)
  • Retry 3: 5 minutes (catches short outages)
  • Retry 4: 30 minutes (catches deployments)
  • Retry 5: 2 hours (catches longer incidents)
  • Retry 6: 8 hours (catches overnight issues)
  • Retry 7: 24 hours (final attempt)

After the final retry, move the webhook to a dead letter queue for manual inspection. Never silently drop failed webhooks.

Idempotency Is Non-Negotiable

Because webhooks can be delivered more than once (the original delivery succeeded but the acknowledgment was lost, so the system retries), every webhook handler must be idempotent. Processing the same webhook twice must produce the same result as processing it once.

The standard approach: include a unique event ID in every webhook. The receiver stores processed event IDs and checks each incoming webhook against the store. If the ID has been seen, return 200 without reprocessing.

def handle_webhook(request):
    event_id = request.headers.get('X-Webhook-ID')
    if event_id and is_already_processed(event_id):
        return Response(status=200)  # Already handled

    process_event(request.json())
    mark_as_processed(event_id)
    return Response(status=200)

Dead Letter Queues

A dead letter queue (DLQ) catches webhooks that have exhausted all retries. It is the safety net that prevents data loss. A good DLQ provides:

  • Visibility: Dashboard showing failed webhooks with error details
  • Manual replay: One-click replay of individual events or bulk replay of a time range
  • Filtering: Filter by endpoint, error type, date range
  • Alerting: Notify when the DLQ exceeds a threshold (indicates a systemic problem, not a transient failure)

Circuit Breakers

If an endpoint has failed 10 times in a row, it is probably down for a while. Continuing to send webhooks wastes resources and may overwhelm the endpoint when it comes back. A circuit breaker pauses delivery after a configurable number of consecutive failures and resumes after a cooldown period.

States:
- CLOSED (normal): deliver webhooks normally
- OPEN (broken): skip delivery, queue webhooks for later
- HALF-OPEN (testing): send one webhook to test recovery

Transition: CLOSED -> OPEN after 5 consecutive failures
Transition: OPEN -> HALF-OPEN after 5 minutes
Transition: HALF-OPEN -> CLOSED if test succeeds
Transition: HALF-OPEN -> OPEN if test fails

What to Log

For every webhook delivery attempt, log: the event ID, the endpoint URL, the HTTP status code received, the response body (truncated), the response time, the attempt number, and the next retry time. When debugging delivery failures, you need the complete timeline of what happened and when.

Never log the webhook payload in plain text if it contains sensitive data. Log a hash or a reference ID that can be used to look up the payload in secure storage.

Testing Webhook Replay

The hardest part of building a replay system is testing it, because you need to simulate failures reliably. WebhookVault helps here: create a test endpoint, send webhooks to it, inspect the payloads, and replay them to your actual handler. This lets you verify your idempotency logic, retry behavior, and error handling without waiting for real failures.

Webhook replay is not a nice-to-have. It is the difference between a webhook system that works in demos and one that works in production. Build it before you need it, because by the time you need it, the data is already lost.

Read more