Designing API Webhook Event Ordering Guarantees: What to Promise and What to Document Away
Customer integrations often assume webhook events arrive in the order they happened. The honest engineering answer is that no widely used webhook system reliably preserves order across retries, parallelism, and recovery. Promising ordering you cannot deliver produces support escalations; refusi
Webhook ordering is one of the more politically charged corners of API design because customer integrations frequently assume ordering they have not actually been promised. A common pattern is the customer building a state machine that processes events as if their arrival order matched the order of underlying state changes, and the integration working correctly for months until a network blip produces an out-of-order delivery that breaks the state machine in a way that is hard to diagnose.
The honest engineering answer is that no widely used webhook system reliably preserves order across retries, parallelism, recovery, and the natural network variance of HTTP delivery. Promising strict ordering produces support escalations the first time the promise is broken. Refusing to engage with ordering at all produces customer integrations that fail in ways the customer cannot debug. The right answer is more careful than either, and the design choices compound enough that getting them wrong early is expensive to fix later.
The three granularities
Webhook providers can in principle promise three different ordering granularities. Strict global ordering says every event arrives in the order it was emitted across the entire provider. Per-resource ordering says events about a particular resource (an order, a subscription, a user) arrive in their emission order, but events about different resources can interleave. Per-account ordering sits between these and is rarely what customers actually need but is occasionally what providers attempt to promise.
In practice, almost no provider can deliver strict global ordering at scale, because the parallelism of webhook delivery is incompatible with serializing every event behind every other event. Per-resource ordering is achievable in theory if the provider keeps per-resource queues and refuses to deliver event N+1 until event N has been acked, but the cost of head-of-line blocking on a slow receiver is substantial, and providers that promise per-resource ordering often soften the promise into "best effort" the first time a slow receiver costs them throughput.
The honest position most providers should take is that no ordering guarantee is offered, and customers should build their integrations assuming out-of-order arrival. The position is unpopular with customers in the short term and correct in the long term.
The retry interaction
Any retry strategy that does not stall subsequent events when an earlier event fails will break ordering. The two reasonable options are stall-on-failure (which protects ordering at the cost of cascading delays during partial outages) and continue-on-failure (which protects throughput at the cost of ordering). Most providers chose continue-on-failure because the alternative makes the service look broken to customers whose receivers are intermittently slow.
The retry interaction also breaks ordering even without failure: an event that gets retried after a brief receiver outage arrives after events that were issued later but did not need retry. The customer's receiver sees events 1, 3, 4, 5, 2 if event 2 failed its first attempt and succeeded on the second.
What customers should build instead
The right customer-side pattern is to treat events as state-transition notifications and resolve actual state from the resource API rather than from event order. The pattern is:
- Receive event with stable event ID and resource ID.
- Check idempotency table for event ID; if seen, ack and skip.
- GET the resource from the provider API to get current state.
- Reconcile current state with local state.
- Record event ID in idempotency table inside the same transaction as the state update.
The pattern works for any ordering, including reverse order or no order. It costs one API round trip per event but eliminates the entire class of order-dependent bugs. Most large webhook integrations end up at this pattern eventually; the question is whether they arrive at it before or after their first ordering incident.
What providers can do to help
Three things help customers build the right integration without forcing the conversation. First, every event includes a server-side timestamp that customers can use to detect out-of-order arrival even if they cannot prevent it. Second, every event includes a stable event ID and clear documentation about idempotency. Third, the provider explicitly documents the lack of ordering guarantee with examples of the failure mode, rather than leaving it implicit.
The opposite anti-pattern is providers who silently rely on the wire-order-usually-matches-emit-order behavior and only document the lack of guarantee deep in an API reference that customers never read. The behavior works for most customers most of the time and produces support escalations from the few customers whose integration assumed ordering and finally hit a case where it broke.
What we do
Across DocuMint, CronPing, FlagBit, and WebhookVault, we explicitly document that webhook event ordering is not guaranteed. We include a server-side occurred_at timestamp and a stable event_id on every event. The documentation includes a paragraph on the right integration pattern (idempotency by event ID, resolve state from API) and a paragraph on the failure mode (retry-after-blip produces out-of-order arrival).
WebhookVault's role as a webhook capture-and-inspect product means we end up seeing the receiving side of many integrations, and the pattern is consistent: customers who built order-dependent state machines have intermittent integration failures, and customers who built event-as-trigger-plus-state-from-API integrations work reliably. The pattern is independent of which provider's webhooks they are receiving.
The deeper trade-off
Webhook ordering is one of the cleaner cases where the cost of getting the abstraction wrong falls primarily on the consumer rather than the provider. Providers that promise ordering they cannot deliver pay for it in incident response. Providers that refuse to discuss ordering pay for it in customer churn. The honest middle—explicitly no ordering, with documentation and event metadata that lets customers build the right pattern—is the option that scales.
The pattern recurs in distributed systems: any promise that conflicts with horizontal scaling and partial failure recovery is a promise that will eventually be broken under load. The reliable promises are the ones that survive the architecture rather than the ones that constrain it.
Our products: DocuMint (PDF invoice generation API), CronPing (cron job monitoring with status pages), FlagBit (feature flags API for modern teams), and WebhookVault (webhook capture and replay) put these patterns into production.