api-design

Designing API Bulk Webhook Replay: Patterns for Catastrophic Recovery

When a customer's webhook receiver has been broken for a day, the replay API needs to deliver thousands of events without overwhelming the receiver or burning through your delivery budget. Bulk replay is a separate primitive from per-event replay, and the patterns that work depend on the receiv

Anethoth

26 May 2026 — 7 min read

Single-event replay is the easy case. A customer's webhook handler dropped a specific event; they click "replay" in the dashboard or call POST /webhook_events/:id/replay; the event re-delivers through normal retry machinery; everyone is happy. Bulk replay—where the customer needs to recover from hours or days of dropped events—is a fundamentally different operation, and pretending it is just "single replay, many times" produces APIs that catastrophically misbehave during exactly the incident they were designed for.

What customers actually need

A customer reaches for bulk replay in three situations, and each has different requirements.

The receiver was broken for hours. The customer's webhook endpoint returned 500s for the duration of a deployment problem or a downstream outage. The customer wants to replay all events from the affected window. The set of events is large (potentially thousands per hour for active integrations), the events are roughly contemporaneous, and the customer needs them in approximately the original order.

The receiver missed events for an extended period. The customer turned off integration for a week of work-in-progress; they now want to backfill the missed events. The set is even larger, the events span a wide time range, and order matters less because the customer's reconciliation logic is built around the final state.

The receiver has a bug in event handling. The customer fixed a code bug in their webhook handler that was silently dropping certain event types. They want to replay all events of those types from the past week. The set is filtered by event type, the time range is moderate, and order within type matters.

Why per-event replay does not compose

The obvious implementation of bulk replay is a loop in the customer's code that calls POST /webhook_events/:id/replay for each event ID they want to replay. This fails for three reasons.

First, the per-event endpoint is rate-limited per customer at a level appropriate for one-off recovery, typically a few requests per second. The customer needs to replay thousands of events, which would take hours at that rate, by which time the events have aged out of useful relevance.

Second, the per-event endpoint typically schedules an immediate delivery attempt. Thousands of immediate deliveries to a single customer endpoint produces a self-DoS that overwhelms the just-recovered receiver and triggers another outage.

Third, the per-event endpoint has no concept of ordering or pacing. Replays arrive in whatever order the customer's loop processes them, which may bear no relation to the original order.

The bulk replay endpoint exists to solve all three problems at once: a single API call that schedules the replay of many events at a controlled rate, in defined order, against a rate-limited delivery channel that protects the receiver.

The shape of a bulk replay API

Minimum viable bulk replay endpoint:

POST /webhook_subscriptions/:id/replay

Body:

{ "filter": { "from_event_at": "2026-05-24T00:00:00Z", "to_event_at": "2026-05-25T00:00:00Z", "event_types": ["invoice.paid", "invoice.failed"] }, "delivery_rate_per_second": 5, "order": "chronological" }

Response:

{ "replay_job_id": "rj_abc123", "estimated_event_count": 4271, "estimated_duration_seconds": 855, "status_url": "/replay_jobs/rj_abc123" }

The endpoint is bound to a specific subscription, not the customer's account, because different subscriptions go to different endpoints with different rate tolerances. The filter selects which events to replay; the delivery_rate_per_second paces the deliveries; the order parameter controls how the events are sequenced.

The response includes a job ID for tracking progress, plus an estimated event count so the customer knows what they are committing to. The estimated_duration_seconds gives the customer a realistic ETA, which is important because bulk replay of large windows takes hours.

The pacing question

The delivery rate is the single most important parameter. Too high and the receiver gets overwhelmed by exactly the bulk replay it requested. Too low and the replay takes longer than the customer can wait.

The right default is conservative: 5-10 events per second is appropriate for most webhook receivers, which typically handle 50-200 events per second under normal load but cannot sustain 1000 per second through a backlog without queuing problems. Customers with high-throughput receivers can override; customers with low-throughput receivers can lower the rate.

The implementation pattern is a leaky-bucket scheduler that releases events at the configured rate, sourcing from the database in chronological-or-event-type-grouped order. The bucket should pause when the receiver returns 429 or 503, with exponential backoff up to a cap, before resuming at the original rate.

The pacing should be honest about the receiver's response codes. If the receiver returns 200s consistently, the rate can stay at configured. If the receiver starts returning 5xxs, the rate should drop until the receiver recovers. The configured rate is the maximum, not the unconditional rate.

The ordering question

Three reasonable answers depending on use case.

Chronological: events delivered in original creation order. This is the right default for receivers whose state is order-dependent (event-sourced systems, accounting integrations).

Reverse chronological: most recent first. Useful for receivers that want to recover quickly to a current-state-consistent view and can backfill history over time.

Unordered: parallel delivery up to the rate limit. Useful for receivers whose state is order-independent and who want maximum throughput. Most customers should not pick this.

The order parameter should be explicit in the API; a default of chronological is correct for most cases, with the dashboard and documentation prominently noting that order can be overridden.

The status and cancellation question

Bulk replay jobs run for minutes to hours, and customers need both status and the ability to cancel.

The status endpoint should expose: current state (pending, running, paused, completed, cancelled), events delivered, events remaining, events failed, current delivery rate, last delivery attempt time, paused-because-of-receiver-errors flag.

Cancellation should be a POST /replay_jobs/:id/cancel that immediately stops scheduling new deliveries. In-flight deliveries complete; queued deliveries are discarded. Cancelled jobs cannot be resumed; the customer must start a new replay if they want the remaining events.

Pause and resume are operationally useful for customers who need to fix a downstream issue mid-replay. POST /replay_jobs/:id/pause and POST /replay_jobs/:id/resume. The state machine adds pause as a non-terminal state.

The dedup question

Customers replaying events expect their receivers to handle duplicates because the standard webhook contract requires idempotency on event ID. But a receiver that has been broken may have partially processed some events before failing, leading to mixed state where some events are partially-processed and others are not-processed. Bulk replay sends all events in the window, including the partially-processed ones, and the receiver has to sort it out.

The standard guidance is that the receiver implements an event_id-based idempotency check at the start of processing, and skipping already-processed events is the receiver's responsibility. The replay endpoint cannot reliably know what the receiver has already processed.

The replay endpoint should send the same delivery headers as original delivery (with an X-Webhook-Replay header indicating replay status), so receivers that distinguish replay can apply different logic if desired (e.g., skip writing to audit logs).

The cost and quota question

Bulk replay can deliver thousands of events. Charging customers per delivery or per webhook is fine in normal operation but becomes punitive during recovery from your-fault outages.

The right policy depends on the cause of the original event drop. If the customer's receiver was broken (4xx, 5xx, timeout from receiver side), replays should consume normal delivery quota. If the customer's events were dropped due to your platform outage, replays should not consume quota.

The implementation typically distinguishes these by the original delivery status. Original deliveries that returned customer-side errors count toward quota when replayed; original deliveries that failed due to platform-side problems do not. This requires the delivery records to capture enough information about the original failure to make the determination.

The dashboard surface

Bulk replay is one of those features that customers reach for during incidents. The dashboard surface needs to be obvious and well-documented.

The dashboard should include: a "bulk replay" button on the webhook subscriptions page, a wizard interface that walks through filter selection and rate configuration, a confirmation screen showing the event count and estimated duration, and a status page showing in-progress jobs with the progress and the option to cancel or adjust pacing.

The dashboard should expose the same API behind the scenes. Customers who use the dashboard should be able to script the same operations against the API. The dashboard should not have features that are not in the API.

Three patterns that fail

First, exposing only per-event replay and expecting customers to loop. This fails for the reasons described above and produces customer-side incidents when they try to use the API for recovery.

Second, allowing unlimited delivery rate with the assumption that customers will configure correctly. Most customers do not, and the result is bulk replays that overwhelm receivers. Cap the maximum rate at a defensible value (typically 50-100/second) and require customers to request limit increases for higher rates.

Third, treating bulk replay as a billing event with the same per-call cost as original delivery. This makes recovery from outages punitively expensive and encourages customers to skip replay or to implement worse manual alternatives. The cost model should encourage the use of the right primitive.

Our use across the four products

WebhookVault, as the primary webhook product, has the most complete bulk replay implementation. The endpoint supports filtering by event type, time range, and delivery status, with configurable pacing and chronological ordering as default. The dashboard exposes the same operations through a wizard interface designed for incident recovery.

CronPing, FlagBit, and DocuMint emit fewer webhooks per customer and have simpler bulk replay surfaces, with per-subscription replay of a time range as the basic primitive. None of them support the full filter grammar that WebhookVault does because the customer base for those products has not pushed for it.

The deeper observation is that bulk replay is the rare feature where the design is dictated by the worst case, not the average case. The single-event replay is used dozens of times per customer; the bulk replay is used a few times per year during incidents. But during those incidents, the design quality of the bulk replay determines whether the recovery is smooth or another incident in itself. The asymmetry argues for spending design effort on the rare case, which is exactly what most APIs do not do because the day-to-day metrics do not reflect incident-time quality.

Our products: DocuMint (PDF invoice generation API), CronPing (cron job monitoring with status pages), FlagBit (feature flags API for modern teams), and WebhookVault (webhook capture and replay) put these patterns into production.