Designing Webhook Handlers That Survive Real Production

The standard webhook tutorial shows you a handler with three lines: parse the payload, do the work, return 200. Real production webhook handlers look nothing like this. They are surrounded by guards, queues, retries, signature checks, idempotency stores, and dead-letter pipelines. Most of those layers are not optional.

This is what actually breaks.

The provider retries faster than your handler can finish

Stripe gives your handler 30 seconds. GitHub gives it 10. If you do not respond in time, the provider retries. If your handler is in the middle of a slow database write or an outbound API call, you will see the same event twice — sometimes three or four times — within a minute.

The fix is not to make your handler faster. The fix is to decouple: receive the payload, write it to durable storage, return 200, and process asynchronously. The webhook endpoint becomes a thin shim that does almost nothing synchronously. Everything else is the queue's problem.

A typical pattern looks like:

@app.post("/webhooks/stripe")
async def receive(request):
    body = await request.body()
    sig  = request.headers["stripe-signature"]
    verify_signature(body, sig)
    event_id = json.loads(body)["id"]
    if already_processed(event_id):
        return Response(200)
    enqueue_for_processing(body)
    return Response(200)

Synchronous work: signature verification (cheap), idempotency check (cheap), enqueue (cheap). Total latency: under 50 ms. The provider is happy. The actual business logic happens in a worker that can take its time.

The provider sends you the same event multiple times even when nothing failed

Most providers document "at-least-once delivery" in fine print and bury the implications. The implication is: your handler will see duplicates. Sometimes minutes apart. Sometimes weeks apart, when the provider replays from a backup.

The fix is idempotency keyed on the provider's event ID. Store every processed event ID in a table with a TTL of at least 30 days. On every incoming event, check before processing. This is not optional. Skipping it is the leading cause of "we charged the customer twice" support tickets in webhook-based systems.

The signature verification is subtler than the docs imply

Every provider has a signing scheme. Stripe uses HMAC-SHA256 over a timestamped payload. GitHub uses HMAC-SHA256 over the raw body. Shopify uses HMAC-SHA256 base64-encoded. They differ in the details and the details matter.

Three failure modes that bite people repeatedly:

You verify against the parsed JSON, not the raw bytes. JSON parsers can normalize whitespace, reorder keys, or change number formatting. The signature is over the raw bytes. Always.
You forget the timestamp window. Stripe's signature includes a timestamp. If you do not check that the timestamp is recent, an attacker who captured one valid webhook can replay it forever. The window should be small — five minutes is standard.
You compare with == instead of hmac.compare_digest. A regular string comparison is timing-attack-vulnerable. Always use a constant-time comparison for cryptographic verification.

The worker dies mid-event and loses work

Once you have decoupled receive from process, you have a new failure mode: the worker picks up an event, starts processing, and crashes before finishing. If the queue removed the event before the worker finished, the event is lost.

The fix is "ack on completion, not on receive." The queue should hold the event invisibly while a worker processes it, and only delete it when the worker explicitly acknowledges completion. If the worker dies, the visibility timeout expires and another worker picks it up. SQS, Redis Streams with consumer groups, and PostgreSQL with SELECT FOR UPDATE SKIP LOCKED all support this pattern.

This works precisely because of the idempotency layer above. Re-processing an event that was already partially processed is safe.

The provider's retries cause cascading load

You have a 30-second outage. The provider tries to send you 100 webhooks during that window and gets timeouts on all of them. Each one goes into the provider's retry queue. When you come back up, the provider releases all 100 retries — plus any new events that arrived in the meantime — in a burst.

The fix is to plan for these bursts. Your queue should have substantially more headroom than your steady-state rate. Your worker pool should be able to scale up to handle bursts. And ideally your worker should not process events any faster than your downstream services can handle, so that a burst does not cascade.

The events arrive out of order

You will see order.fulfilled before order.created. This is not a bug in the provider. It is a consequence of the provider's distributed architecture. Your handler should tolerate it.

The usual approach: each event includes the latest snapshot of the relevant entity, not just a delta. So even if you process events out of order, your local state converges. If the provider only sends deltas, you need to fetch the current state via the API after each event — which is more expensive but more reliable.

Dead-letter handling

Some events will fail to process repeatedly — malformed data, unrecognized event types, references to deleted entities. After some retry budget is exhausted, the event should go to a dead-letter queue and a human should be notified. Silently dropping failed events is the worst possible outcome: you have lost data and you do not know it.

The dead-letter queue should have its own monitoring. If it is filling up, something is wrong upstream and you want to know about it before the customer does.

Tools that help

WebhookVault captures every incoming webhook so you can replay them after fixes, inspect what providers actually sent, and confirm signatures match. CronPing monitors that your worker is processing events on schedule, alerting you when it stops. FlagBit lets you turn off processing for a specific event type without redeploying.

The principle

The naive webhook handler is a function. The production webhook handler is a system: receiver, queue, worker, idempotency store, dead-letter queue, monitoring. Each piece is small. Together they make webhook handling reliable.

The hardest part is not building any one piece. It is recognizing that you need them all before you find out the painful way.