Designing Webhook Replay APIs: Patterns for Idempotent Re-Delivery

A webhook replay button is one of the highest-trust features a webhook API can offer, and one of the easiest to design wrong. The shape that customers actually want is replay-by-event-ID, not replay-by-time-range, and the discipline that makes it safe is consumer-side idempotency that is docume

The shape of a webhook integration that has just gone wrong is familiar: the receiver was down for 20 minutes, the sender's retry budget expired, hundreds of events are missing, and the customer is on the phone asking how to recover the state. The answer most webhook APIs offer is some variant of "please log in to the dashboard and refetch the missing data via the REST API." That answer is technically correct and operationally exhausting. The answer that customers actually want is a replay button.

A working webhook replay is one of the highest-trust features a webhook API can offer. It is also one of the easiest to design wrong in ways that produce silent data duplication or hide failures behind the appearance of success. We thought through the shape carefully for WebhookVault and applied the same pattern to the webhook surfaces of CronPing, FlagBit, and DocuMint. The shape that ages well is replay-by-event-ID with mandatory consumer-side idempotency, not replay-by-time-range with optimistic assumptions about consumer behavior.

Replay by event ID, not by time range

The first design decision is what unit of replay the API exposes. The intuitive choice is a time range: "replay everything from 14:00 to 14:30." The intuitive choice is wrong, and the reason is that time ranges hide volume.

A customer who clicks "replay last 30 minutes" might trigger 5 events or 500,000. The same UI affordance produces wildly different blast radii depending on traffic. Worse, the customer often does not know which time range to pick: if the receiver was intermittently flaky between 13:50 and 14:15, the customer guesses a window that includes both events that did not deliver and events that did. The replay duplicates the latter and is correct only because the consumer is idempotent. If the consumer is not idempotent, the time-range replay creates the same kind of damage as the original failure.

The right unit is the individual event. The API exposes a list of failed-or-undelivered events from the recent window, and the customer (or the customer's own automation) selects which events to replay. Each event has a stable ID, the replay request takes a list of event IDs, and the sender re-fires those specific events to the destination. The customer sees exactly what is being replayed and can choose to replay everything, replay nothing, or replay a subset.

The minimum viable replay schema

Three tables, none of them complicated. A webhook_events table with stable event IDs and the full event payload, retained for a configurable window (we default to 30 days). A webhook_subscriptions table mapping event-type filters to destination URLs and signing secrets. A webhook_deliveries table recording every delivery attempt with status, response code, and timestamp, keyed by (event ID, subscription ID).

A delivery is "successful" if the destination returned a 2xx within the configured timeout. A delivery is "failed" if all retry attempts produced non-2xx responses or timeouts. A delivery is "pending" if it is still being retried. The replay API surfaces failed and pending deliveries and lets the customer trigger a fresh delivery attempt for any of them, which writes a new row to webhook_deliveries with the new attempt's outcome.

The retention window for events is the maximum replay horizon. Customers who need a longer horizon should be told to fetch the missing data via the REST API, because storing every webhook payload indefinitely is rarely worth the cost. 30 days is enough for almost every recovery scenario; the rare longer-term recoveries are recoveries from a fundamentally broken consumer that should be rebuilt from scratch anyway.

The consumer idempotency contract

The replay API only works if the consumer is idempotent. The replay button is a tool that re-fires events; if the consumer cannot tolerate seeing the same event twice, replaying creates duplicates. The webhook API's job is to make this contract explicit, document it loudly, and verify that consumers behave correctly.

The minimum contract is that the consumer keys all side effects on the event ID. The receiver maintains a processed_events table (or equivalent) and inserts the event ID before performing the side effect, using a unique constraint to detect duplicates. If the insert fails, the event has already been processed and the consumer returns 2xx without re-doing the work. This is the same pattern as idempotency keys for ordinary API requests, and the webhook documentation should describe it explicitly with code examples for the major frameworks.

The webhook payload includes the event ID in a stable, easy-to-extract location (usually a top-level field). Replays use the same event ID as the original delivery, so the consumer's deduplication works without any awareness that the delivery is a replay. The HTTP request can optionally include an X-Webhook-Replay header that indicates the delivery is a replay, which is useful for logging and metrics but should not change the consumer's behavior — the deduplication is what makes it safe.

What the dashboard should show

Three views in the dashboard that close the loop. First, a list of failed deliveries with their event IDs, destination URLs, last attempt timestamp, and response code, sorted by recency. The customer can filter by destination, by time range, or by status. Each row has a "replay" button that triggers a single delivery attempt.

Second, a per-event view that shows the full payload, the full delivery history (every attempt, every response), and a "replay" button. This is the view a customer ends up on when they are investigating a specific failure: they have an event ID from somewhere (their support ticket, their own logs) and they want to see what happened.

Third, a bulk replay tool that lets the customer select a range of events and replay them in batches. This is the tool customers reach for during incident recovery when a destination was down for an extended period. It should rate-limit the replay (so a one-click recovery does not DDoS the receiver), and it should track progress so the customer can see the recovery happening.

The API for programmatic replay

The dashboard is a UI on top of an API, and the API should be first-class. The shape is straightforward: GET /webhook_deliveries?status=failed&since=... to list candidates, POST /webhook_deliveries/{id}/replay to trigger one replay, and POST /webhook_deliveries:batch_replay for many. The batch endpoint returns 202 Accepted with a job ID that can be polled for progress.

The programmatic API enables customer-side automation for the recovery patterns that recur. A customer whose primary database failover triggers a 30-minute receiver outage can write a recovery script that lists failed deliveries during the outage window and replays them. The webhook API is not in the loop for the recovery; the customer's tooling is. That is the right level of abstraction for a webhook product that respects customer engineering.

The operational signals

Three signals to monitor on a replay API. First, the replay request rate, which tells you when customers are recovering. A spike in replays usually correlates with a recent destination outage; if it correlates with nothing visible, that is a signal something is wrong with your delivery pipeline that customers are masking by replaying. Second, the replay success rate, which should be much higher than initial delivery success rate (because the receiver has had time to recover); if it is not, customers are replaying events to destinations that are still broken, which produces wasted load. Third, the time-since-event-creation distribution of replays, which tells you whether your retention window is enough — if customers are routinely replaying events near the retention limit, the window is too short.

The deeper observation

A webhook replay button is the API equivalent of an "undo" button: it is a feature that exists specifically to make recovery from a category of failures painless. The features that exist for recovery are the features that customers remember as "I was glad they had this when I needed it." The discipline that makes a replay button safe is the same discipline that makes the rest of the webhook API trustworthy: stable event IDs, explicit consumer idempotency, and clear semantics about what each operation does. The replay button is not the hard part; the hard part is building a webhook API where replay is a natural extension of how it already works.

Read more