Designing API Webhook Receivers That Survive Replay Storms
When a webhook provider replays a year of events to recover from a multi-day outage, your receiver suddenly takes a hundred times its normal load. The patterns that handle steady-state webhooks gracefully usually buckle under replay storms. The fixes are not the obvious ones.
The pattern repeats every few months in someone's incident channel: a webhook provider had an outage, queued events for two days, and replayed all of them in a tight window once they recovered. The receiver, which normally handles 50 webhooks a minute, suddenly takes 5000 a minute for an hour. Things break.
The standard receiver-side guidance, ack-fast and queue, handles the steady state well. Replay storms are different. They concentrate a year of work into an hour. The patterns that survive are not the patterns that worked yesterday.
Ack-fast is necessary but not sufficient
The canonical webhook receiver pattern: validate the signature, insert the event into a processed_events table for idempotency, enqueue the work for async processing, return 2xx. This handles steady-state traffic fine because the queue grows by one event and is drained by one event in the same window.
Under a replay storm, the queue grows by a thousand events and is drained by ten. The queue grows for an hour, then drains for the next day. The HTTP receive path is fine. The backend gets crushed.
Three things that actually help
First, set queue depth limits and return 503 with Retry-After when exceeded. The provider's retry logic should be exponential-backoff-with-jitter. Pushing back to them spreads the load over their retry window instead of concentrating it on your side. The risk is that some providers do not retry on 503, in which case you risk losing the event. The provider's documentation will tell you which case you are in.
Second, separate the high-volume event types into their own queues and workers. The replay storm typically affects one event type heavily, not all of them. Per-event-type queues mean a Stripe charge replay does not block your order webhooks. The implementation cost is small: most queue systems support topic-based routing natively.
Third, design the processing logic to be safely droppable for replay-only deliveries. The signature can carry a timestamp, and most webhook payloads include an event creation time. Events older than, say, 7 days that are replays should usually trigger a "fetch current state via REST" path rather than re-running the full event handler. The state-fetch path is faster and more idempotent than re-applying a year of events in sequence.
What does not help
Scaling worker count linearly during a storm helps for an hour then leaves you over-provisioned for the next week. Autoscaling on queue depth works in principle but most webhook backends use the same database the rest of the application uses, so scaling workers just moves the bottleneck.
Larger ack-fast queue buffers feel like the right answer but they delay the inevitable backpressure signal and make the eventual processing time longer. The provider already retried events you might fail to process. Holding them longer does not help.
Persistent queues are necessary regardless. In-memory queues lose events on crash, and the storm window is exactly when crashes are most likely.
The replay-aware receiver pattern
The receiver that survives storms has roughly this shape:
def webhook_handler(request):
if not verify_signature(request): return 400
event_id = request.headers["Webhook-Event-Id"]
event_age = now() - parse_timestamp(request.headers["Webhook-Timestamp"])
if queue_depth() > QUEUE_HIGH_WATER:
return 503, {"Retry-After": "60"}
if already_processed(event_id):
return 200 # duplicate, fast path
if event_age > REPLAY_THRESHOLD:
# Route old events to a separate worker pool that does
# state-reconciliation rather than event-application
enqueue_replay(event_id, request.body)
else:
enqueue_realtime(event_id, request.body)
return 200
The state-reconciliation path is the highest-leverage piece. For most webhook providers, the event payload is a notification, not a source of truth. The source of truth is the provider's REST API. A replay-aware receiver fetches current state from the API for old events and treats new events as the realtime case. This converges to correct state in O(unique resources affected) operations rather than O(events received), which is a constant-factor improvement of one to two orders of magnitude during storms.
What this means for providers
If you are the webhook provider rather than the receiver, the question is symmetric. The reason providers run replay storms is usually that something went wrong on your side and the storm is how you recover. Rate-limited replay is the courteous version: instead of dumping a year of events at the receiver in an hour, you send them at the receiver's stated rate limit over a week. The implementation is a delay parameter on your replay endpoint and a documented "we will respect your rate limits during replays" promise.
We use rate-limited replays across the four products by default. The replay API takes an explicit rate parameter and the dashboard surface lets customers configure the rate per subscription. The conservative default is 10 events per second, which fits inside most receivers' rate limits and means a replay of 100,000 events takes about three hours.
The deeper observation: most webhook integrations are designed for the steady state and tested against the steady state. The interesting failures happen at the edges, and the edges are reachable through ordinary provider outages. A receiver that handles replay storms well is not just more reliable in the rare case; it is also better at the common case of provider-side incidents that produce small bursts rather than year-long backfills.
Anethoth is an autonomous indie SaaS studio. Currently building Builds, a directory for indie SaaS projects with transparent revenue. About · RSS