Webhook Retries Done Right: Backoff, Jitter, and Giving Up

The first time you ship a webhook integration in production, you discover that the network is hostile, your consumer is hostile, and your own retry logic is somewhere on a spectrum between useless and self-inflicted denial of service. The good news is the working patterns are small in number; the bad news is most teams reinvent them badly.

Retry only on signals that mean "try again"

The first decision in any retry policy is the trigger. The wrong default is to retry on every non-2xx response. The right default is to retry on a small whitelist: connection refused, connection reset, read timeout, DNS failure, and HTTP 502, 503, 504, 408, and 429. Anything else — a 400, a 401, a 422 — means the consumer has rejected the payload deliberately, and re-sending it will produce the same result an unbounded number of times. Retrying on 4xx is how you turn a bad customer integration into an unintentional flood.

HTTP 429 is the interesting case. It means "you're going too fast" and almost always carries a Retry-After header. Honour it. Treat the response as a hint, not a punishment, and back off accordingly.

Exponential backoff, but capped, with jitter

Pure exponential backoff — 1s, 2s, 4s, 8s, 16s, 32s — is the textbook answer. It is also incomplete. Two failures in production reveal what is missing.

The first is the thundering herd: when an entire downstream goes down for 30 seconds and recovers, every webhook in flight retries on the same exponential schedule, hitting the recovered service at exactly the same wall-clock moments. The fix is jitter. Pick a random delay between zero and the current backoff value. AWS calls this "full jitter" and it dramatically smooths the load curve on recovery.

The second is the unbounded backoff. Without a cap, your worker process will eventually be sleeping for 4 hours between retries on a webhook that was queued days ago. The right ceiling for most webhook systems is one hour. After that, the producer has to make a different decision than just trying harder.

Total budget, not retry count

"Retry up to 5 times" is the common spec, but it conflates two different policies. Some events are valuable enough to retry for 24 hours; others lose meaning after 60 seconds. The right shape is a budget: a maximum delivery window and a maximum number of attempts within it, whichever comes first.

For most webhooks, 24 hours and 12 attempts is a sane default. Stripe famously retries for 3 days. GitHub retries for 8 hours. The number you pick should be visible to the consumer in your documentation, because their idempotency window has to be wider than your retry window or they will end up processing duplicates after a quiet period.

Persist the queue, don't keep it in memory

The most common production failure is a process restart eating an in-flight retry queue. If your webhook delivery worker keeps state in a Python list or a Node Map, every deploy loses every pending retry. The right floor is a persistent table — SQLite is more than enough for tens of thousands of pending deliveries — with columns for next attempt time, attempt count, last status, and last error. Workers claim rows atomically and update them in place.

This also makes observability trivial. "How many retries are pending right now?" becomes a SELECT statement instead of a metric you forgot to instrument.

Tell consumers what idempotency they need

If you retry, your consumer will eventually receive the same event twice. This is a feature, not a bug, but only if the consumer knows. Send a stable event ID in a header (we use X-Event-Id) on every attempt of the same delivery. Document explicitly that consumers must dedupe by this ID. Not by payload hash, not by timestamp — by your event ID, because that's the one thing guaranteed to be identical across retries.

Dead-letter, with a human in the loop

After your retry budget runs out, the delivery has to go somewhere. Silently dropping it is the worst outcome — the producer assumed it succeeded, the consumer never saw it, and the inconsistency surfaces a week later as a customer support ticket. The right move is a dead-letter table, an alert, and an admin UI that lets a human inspect the request and either retry it manually or abandon it deliberately.

This is the small, boring, end-of-funnel work that distinguishes a webhook system that "mostly works" from one that operators trust. WebhookVault exposes captured retries with the same shape as our retry table for exactly this reason: when something has been retrying for an hour, you want to see what's actually going over the wire, not what your retry counter claims.

The pattern in one paragraph

Retry on a small whitelist of failures. Use exponential backoff with full jitter, capped at one hour. Express your policy as a delivery budget — a window plus an attempt cap — not a raw retry count. Persist pending retries on disk. Send a stable event ID for idempotency. End with a dead-letter table that a human can act on. Six small choices, and your webhook delivery suddenly behaves like infrastructure rather than a science project.