Designing API Webhook Idempotency on the Receiver Side: The Contract Providers Push to Customers

Webhook providers cannot guarantee exactly-once delivery, so they push idempotency onto the receiver. The pattern is well-known to providers and consistently underdocumented for receivers. What receivers actually need to do, and the failure modes when they skip it.

Webhook providers converge on a contract that includes at-least-once delivery and explicit non-guarantees around ordering, exactly-once semantics, and timely delivery. The contract is shaped by the underlying physics of HTTP delivery to unreliable receivers across the public internet: providers cannot guarantee exactly-once delivery without holding events indefinitely on failure, which violates the timeliness customers want; providers cannot guarantee ordering across retries without serializing all delivery, which violates the throughput customers want; providers cannot guarantee delivery within any specific time window without dropping events on persistent failure, which violates the durability customers want.

The consequence is that webhook providers push idempotency onto the receiver. The provider sends each event with a stable identifier, and the receiver is responsible for using that identifier to deduplicate. The contract is well-known on the provider side and is documented in the integration guides of every major webhook-emitting product. The contract is consistently underdocumented for the receiving side, with most provider integration guides giving a single sentence about idempotency keys and then moving on. The receiver-side pattern is the actual integration work that determines whether the customer's integration is robust or fragile, and the gap between the documentation and the work is one of the most common sources of webhook-related production bugs.

What the receiver needs to do

The minimum viable receiver-side idempotency pattern is a processed_events table with a unique constraint on the provider's event ID. The schema is small: id (the provider event ID), received_at (when the receiver first saw it), processed_at (when the receiver finished processing it), result_summary (a small text field capturing what happened). The processing flow is insert-then-process: the receiver attempts to insert the event ID into processed_events with the received_at timestamp, and if the insert fails because of a unique constraint violation, the receiver treats the event as a duplicate and acknowledges it without reprocessing.

The pattern depends on the unique constraint being enforced at the database level rather than at the application level. Application-level checks introduce a race condition where two concurrent webhook deliveries can both check, find no existing record, and proceed to process. The database-level constraint is atomic; one of the inserts succeeds and the other fails with a constraint violation that the receiver can catch and treat as the duplicate signal. The pattern is one of the cleanest cases for using a database constraint as application-level coordination machinery, and it is one of the patterns where teams that try to optimize the constraint away usually introduce bugs they later have to debug.

The transactional integration question

The receiver-side idempotency record and the actual side effects of processing the event need to be tied together in a way that the failure modes do not produce partial state. The textbook pattern is to wrap the insert into processed_events and the side effects in the same database transaction, so that either both succeed or both fail. The pattern works when the side effects are database operations within the same database; it does not work when the side effects include external API calls or operations on other databases.

The mixed-side-effect case is the harder one. The pattern that scales is the transactional outbox: the receiver inserts the idempotency record and an outbox row in the same transaction, and a separate worker reads from the outbox and performs the external side effects. The outbox worker is itself idempotent, using the outbox row ID as the key. The pattern produces effectively-once external side effects despite at-least-once delivery from the provider and at-least-once delivery from the outbox to the side effect, and is the standard recipe for webhook receivers that need to integrate with external systems.

The ack-fast pattern

The receiver should acknowledge the webhook delivery as fast as possible and defer actual processing to an asynchronous worker. The reason is that webhook providers typically have an aggressive timeout for delivery acknowledgement, often in the range of 10-30 seconds, and many receiver-side processing operations can take longer. A receiver that acks slowly forces the provider to time out and retry, which produces duplicate deliveries that the idempotency layer has to catch.

The ack-fast pattern is: receive the webhook, validate the signature, insert into the idempotency-and-queue table, return 2xx. The actual processing happens in a separate worker that reads from the queue. The pattern produces fast ack times measured in tens of milliseconds and decouples the processing from the provider's timeout window. The pattern also has the side benefit that processing failures do not produce retry storms from the provider; the provider sees a successful ack, the processing failure is handled internally by the receiver's worker with its own retry policy, and the provider does not need to know about the failure.

The schema evolution question

Webhook payloads evolve over time. New fields are added, optional fields become required, occasionally fields are deprecated. The receiver-side pattern that survives schema evolution treats the payload as untrusted data and only depends on the small set of fields that are part of the documented stable schema. The stable fields typically include the event ID, the event type, the timestamp, and a reference to the resource that changed. The receiver should use these stable fields to identify the event and look up the resource through the provider's REST API for the current state rather than depending on the full payload for resource details.

The fetch-current-state-on-event pattern is the right default for most webhook integrations because it converts schema-evolution risk into a quota-and-latency concern that is easier to manage. The trade-off is that the pattern requires an additional API call per event, which doubles the integration's quota usage and adds latency. The pattern is wrong when the event payload includes information that is not in the current state, such as the before-state for a state-change event; in those cases the receiver has to use the payload and accept the schema evolution risk.

The error handling question

The receiver needs to distinguish between transient and permanent processing failures. Transient failures should produce a non-2xx response so the provider retries; permanent failures should produce a 2xx response and a separate alerting mechanism so the provider does not waste retry budget on events that cannot succeed. The classification is application-specific, but the general rule is that infrastructure failures including database connection failures and external API failures and timeouts are transient, while data validation failures and business logic errors and authorization failures are permanent.

The permanent-failure-without-retry pattern requires a separate visibility surface so the receiver-side operator can investigate. The pattern is typically an internal dashboard that lists permanent-failed events with the error details, the event payload, and a retry button for cases where the underlying issue has been fixed and the event should be reprocessed. The pattern is the receiver-side mirror image of the provider-side webhook delivery dashboard, and the operational value is similar: it converts a class of incident from urgent investigation into routine triage.

Three patterns that fail

The first pattern that fails is processing-without-idempotency. The pattern assumes the provider will not deliver duplicates, which is wrong for every webhook provider. The failure mode is double-processing during retry storms, which produces double charges, duplicate inventory adjustments, double notifications, and other side effects that are visible to end users and expensive to clean up. The pattern is most common in early-stage integrations where the team has not yet seen a retry storm and is most expensive when the storm finally arrives.

The second pattern that fails is ack-after-processing. The pattern combines synchronous processing with delivery acknowledgement, so the provider waits for processing to complete before considering the delivery successful. The pattern produces slow acks during processing spikes, which produces timeouts, which produces provider retries, which produces duplicate processing because the original processing eventually succeeded but the ack arrived late. The pattern compounds: the duplicate processing increases the load that caused the slow acks, the slow acks produce more retries, and the receiver enters a doom loop that the provider does not know about.

The third pattern that fails is application-level idempotency checks. The pattern uses a select-then-insert pattern rather than insert-then-handle-conflict, which introduces a race condition between concurrent webhook deliveries. The race condition is invisible at low webhook volumes and visible at high webhook volumes, which means it manifests during traffic spikes or during retry storms when the duplicate-handling matters most. The fix is the database-level unique constraint pattern, which is atomic and does not have the race condition.

Our use across DocuMint, CronPing, FlagBit, and WebhookVault

The four products are mostly webhook-emitting rather than webhook-receiving, with the exception of DocuMint which receives Stripe webhooks for billing events. The DocuMint receiver follows the patterns described in this article: a processed_stripe_events table with the Stripe event ID as a unique constraint, the transactional outbox pattern for cases where the event triggers external side effects, the ack-fast pattern with the actual processing happening in a separate worker, and the fetch-current-state-via-Stripe-API pattern for cases where the receiver needs more detail than the event payload provides.

The pattern has been in place since DocuMint launched and has caught duplicate Stripe deliveries during retry windows on multiple occasions, including one case where a Stripe-side issue produced sustained duplicate deliveries over a 30-minute window. The duplicate-handling worked transparently and the customer-facing billing was correct without operator intervention. The case is the kind of operational evidence that justifies the upfront investment in idempotency machinery; the cost of building it is small, the cost of not having it during the first incident is large, and the cost ratio is the kind of asymmetric situation where the right call is to invest in the prevention rather than the recovery.

WebhookVault is the product that most directly cares about receiver-side patterns because the product is a webhook receiver for customers' upstream integrations, and the patterns we recommend for receivers are the patterns we implement internally. The dogfooding alignment between what we recommend and what we build is part of how WebhookVault stays useful as a webhook-debugging product: the implementation choices we make are the same choices we recommend customers make, and the patterns we test in production are the ones we are most confident recommending.


Our products: DocuMint (PDF invoice generation API), CronPing (cron job monitoring with status pages), FlagBit (feature flags API for modern teams), and WebhookVault (webhook capture and replay) put these patterns into production.