Why Your Webhook Retry Schedule Matters More Than Your Retry Count

PatternWebhook retry schedulingRiskDelivery failure during receiver downtime even with retries enabledSignalRetry window, not retry count, determines whether maintenance windows are survivable

The most common webhook retry configuration looks something like this: retry up to 5 times, with a 1-minute delay between attempts. It feels thorough. It has retries. The count is visible in the dashboard. The problem is that this configuration fails completely whenever the receiver is down for more than 5 minutes — which is exactly when retries matter most.

A receiver going down for a rolling deployment, a database restart, or a maintenance window typically stays down for 10 to 30 minutes. Five retries over five minutes expend all attempts during the outage and deliver nothing. The webhook is dead. The customer discovers missing events in their system hours later.

The major webhook providers solved this years ago by making the retry window the primary variable, not the count.

Stripe retries over 3 days, with attempts spaced across hours and then days
GitHub retries over 2 days
Twilio retries over 24 hours

None of these are 5-retries-over-5-minutes. They all converge on hours-to-days windows. The count is secondary; it's the window that determines whether routine receiver downtime is survivable.

The three retry schedules and their tradeoffs:

Linear: 5 min, 10 min, 15 min, 20 min. Simple to reason about. Recovers from short outages. Falls apart for anything longer than the total window.

Exponential: 5 min, 25 min, 2 hr, 10 hr, 48 hr. Survives multi-hour outages. The long tail catches events that would otherwise be lost. The downside is that a transient receiver failure produces an attempt pattern that looks odd in logs.

Exponential with jitter: Same as exponential, but each interval has ±30% random variation. The jitter prevents synchronized retry storms when many subscriptions fail simultaneously after a receiver-side outage. If 10,000 subscriptions all failed at 14:00 and all retry at precisely 14:05, you've built a thundering herd. Jitter spreads the load. This is the correct default for any production webhook system with more than a few hundred subscribers.

The Retry-After header deserves first-class status. If a receiver returns 429 or 503 with a Retry-After header, the webhook provider should respect it. A receiver that says "wait 120 seconds" knows its recovery time better than the sender's exponential formula does. Ignoring Retry-After and retrying on schedule is the sender imposing its schedule on a receiver that has explicitly communicated its own.

What not to retry: 4xx errors that aren't 429. A 400 means the payload is malformed — retrying it delivers the same malformed payload again. A 403 means authentication failed — the credentials haven't changed between retries. A 404 means the endpoint doesn't exist. These are permanent failures; retrying them is noise. A 5xx error means something is wrong on the receiver side and may be transient — retry these.

The dead-letter queue is the actual safety net, not the retry schedule. Retries handle transient downtime. A DLQ handles events that have exhausted all retry attempts. Without a DLQ, events that fail after all retries are simply gone. With a DLQ, they're recoverable after the customer fixes whatever broke. Build the DLQ first; tune the retry schedule second.

Most webhook implementations invert this. They spend engineering cycles on retry count configuration while leaving the window at minutes rather than hours. The retry window is what determines whether your webhook delivery is reliable across normal operational conditions. The count is an implementation detail.

Building something? Document your progress at builds.anethoth.com — proof that a product is really being built.