Designing API Webhook Backoff: Retry Schedules, Maximum Attempts, and the Patterns Customers Trust

Webhook retry policy is one of the highest-leverage design decisions in a B2B API because it determines the contract between your delivery guarantees and your customer's recovery options. The patterns that scale are not the ones most APIs ship with.

Every API that emits webhooks eventually has to decide what happens when delivery fails. The choices are not subtle (how long to keep trying, how to space attempts, when to give up) but the consequences compound for customers and the patterns that emerge across mature webhook products are remarkably similar. This post documents what those patterns are, why they converged, and where the trade-offs sit.

Why retry policy is a customer-facing contract

From the API provider's perspective, retry policy looks like an implementation detail. From the customer's perspective, it is one of the load-bearing pieces of the integration contract because it determines what customer-side reliability they can build on. If the API gives up after three attempts spaced five minutes apart, the customer has fifteen minutes to be reachable. If the API retries for 24 hours with exponential backoff, the customer can take their receiver down for emergency maintenance without losing events. The same API serving the same workload looks dramatically different to integrators depending on the retry policy.

This is why every mature webhook product (Stripe, GitHub, Shopify, Linear, Twilio) publishes their retry schedule in detail. The schedule is a contract.

The basic retry schedule shape

Three patterns dominate.

Exponential backoff is the textbook answer: attempt at t=0, then 1m, 5m, 25m, 2h, 6h, 24h. Each subsequent interval is some multiplicative factor of the previous (commonly 5x). The pattern fits the case where the receiver is temporarily down and will probably come back, because most outages are short and the exponential spacing means the retry storm tails off quickly without abandoning the event entirely.

Linear backoff is the simpler variant: attempt at t=0, 5m, 10m, 15m, 30m, 1h. Linear backoff is rarely the right answer because it spends more attempts at intermediate intervals where the receiver is unlikely to have recovered, and gives up sooner on the long-duration outages that exponential backoff handles better.

Fixed-interval is the wrong default: attempt every 5m for 24h. The retry-storm problem is severe (288 attempts) and the wasted-attempts problem is also severe (the receiver that came back after 6 hours got reached at 6h05 but was reached uselessly 70 times in between).

The right default for most webhook products is exponential backoff with full jitter, a cap of around 1 hour between attempts, and a total delivery budget of around 24-72 hours.

The maximum-attempts question

The reasoning splits across event types.

High-value events (payment confirmations, account state changes, security alerts) should have long retry windows. 72 hours is the right default for these. The cost of giving up is real customer harm; the cost of continuing to retry is small.

Operational events (job completion notifications, status updates, cron pings) should have shorter retry windows. 24 hours is the right default. The events become operationally stale after a day regardless of whether they ever delivered.

Real-time events (typing indicators, presence updates, live cursor position) should have very short retry windows or no retries at all. The information is stale within seconds, so a retry that succeeds five minutes later is delivering wrong information.

The mistake API designers make is picking one retry policy for all event types. The right design is per-event-type or at minimum per-subscription configurability.

The jitter requirement

Without jitter, every receiver that goes down at the same time gets retried at the same times by every API they integrate with. The recovery moment becomes a synchronized retry storm that can crash the receiver again. The fix is full jitter: instead of retrying at exactly t+5m, retry at a random time between t+0 and t+5m. The expected wait is the same, but the synchronization is broken.

The implementation is one line of code (replace fixed_delay with random_uniform(0, fixed_delay)) and it has saved many integrations from recovery-time outages. Every modern webhook product implements it.

The dashboard requirement

The retry policy only earns customer trust if customers can see what is happening. The minimum dashboard surface is: per-subscription list of recent delivery attempts with status, response code, response time, and timestamp; per-event view showing all delivery attempts for a single event; manual replay button for failed or successful events; bulk replay for the case where the customer's receiver was down for hours.

The dashboard is the difference between webhook-API and webhook-product. Without it, customers cannot diagnose their own integration problems and every failure becomes a support ticket. With it, most support tickets become self-service.

The dead-letter pattern

When the retry budget is exhausted, the event should not silently vanish. The standard pattern is a dead-letter queue: events that failed to deliver get moved to a separate state where they remain visible in the dashboard for some retention period (commonly 30 days). The customer can manually trigger replay for any dead-lettered event during that window.

The trap is treating dead-letter as a problem the customer should fix. The right framing is that dead-letter is the existence of the recovery path. An event that was retried for 24 hours and failed every attempt is almost certainly indicating either a long-running receiver outage or a misconfigured endpoint, both of which the customer needs to act on. The dead-letter surface is what makes the action possible.

The synchronous-vs-asynchronous error split

Not all delivery failures should retry. The right split is by status code:

4xx responses other than 408 and 429 should not retry. A 400 or 422 or 403 indicates the receiver actively rejected the event. Retrying will produce the same response and waste resources. The right action is to mark the delivery failed immediately and surface it in the dashboard.

408 (request timeout) and 429 (rate limited) should retry. These are transient signals that the receiver is asking for a retry. The retry should honor any Retry-After header the receiver provides.

5xx responses should retry. These indicate transient receiver problems that the standard exponential-backoff schedule is designed for.

Connection failures (DNS resolution failure, TCP connection failure, TLS failure) should retry. These are the canonical transient errors.

The mistake APIs make is retrying every non-2xx response. This wastes resources on 4xx responses that will never succeed and produces dashboard noise that hides the genuine transient failures.

Three operational signals to monitor

Delivery success rate (200-299 responses divided by total attempts) is the headline metric. Sustained drops below 95% indicate either a widespread receiver problem (most customer endpoints are down, which is rare) or a sender-side delivery problem (something is wrong on our end).

Per-subscription failure rate is the diagnostic metric. The sender-side failure mode is uniform across all subscriptions; the receiver-side failure mode is concentrated in a small number of subscriptions. The split tells you where to look.

Dead-letter rate (events that failed all retries) is the customer-impact metric. Sustained nonzero dead-letter rates indicate either persistent receiver problems (the customer's endpoint URL changed and we are still trying the old one) or events that exceed normal recovery windows.

Our use across the four products

Our four products (DocuMint, CronPing, FlagBit, WebhookVault) share a single retry implementation: exponential backoff with full jitter, 1 hour cap, 24 hour total budget, 12 maximum attempts. Stripe webhook handlers honor Stripe's own retry semantics rather than emitting our own (we are the receiver, not the sender). WebhookVault is the only product where webhook delivery is the primary product surface; the others emit operational webhooks as a secondary feature. The 24-hour budget is a deliberate trade-off favoring receiver-side reliability over rapid abandonment.

The deeper observation

Webhook retry policy is one of those decisions that looks like an implementation detail and is actually a customer contract. The patterns that have converged across mature webhook products are not arbitrary; they reflect the real trade-offs between delivery guarantees, receiver-side complexity, and dashboard usability. The deeper observation is that the API surface customers actually depend on includes the operational behaviors, not just the request-response interface, and the operational behaviors deserve the same level of design attention as the data model.

Our products: DocuMint (PDF invoice generation API), CronPing (cron job monitoring with status pages), FlagBit (feature flags API for modern teams), and WebhookVault (webhook capture and replay) keep the lights on.

Read more