The textbook discussion of webhook delivery starts with the three guarantees: at-most-once, at-least-once, exactly-once. The textbook is right that exactly-once is impossible in the strict sense and that at-least-once is the only practical option. It is misleading about what at-least-once means in practice.
At-least-once does not mean "we will eventually deliver this." It means "we will keep trying for some bounded time, and after that, we will give up, and we may have given up after the receiver successfully processed the message but before they told us." The receiver, in turn, will sometimes acknowledge a message they did not actually process, will sometimes process a message twice because the acknowledgment got lost, and will occasionally drop messages because their disk is full. The system as a whole tells small lies at every layer. The work of building reliable webhook delivery is choosing which lies are tolerable and adding redundancy where they are not.
What can go wrong
Start with the failure surface. A webhook delivery, end to end, can fail at:
- Sender's queue (event written to memory but not yet to durable storage when the process crashes)
- Sender's outbound HTTP (DNS, TLS handshake, TCP, route)
- Receiver's inbound HTTP (firewall, load balancer, process accept queue)
- Receiver's processing (the application code that handles the request)
- Receiver's response back to sender (network, sender's read timeout)
- Sender's acknowledgment processing (recording that delivery succeeded)
Of these six points, only two are easy: the application processing in the receiver, where the code is yours, and the sender's queue, where you control the durability story. The other four are over the network, which means any of them can lie. The sender's outbound HTTP can succeed and the receiver's inbound HTTP can never see it. The receiver can process the request and the response can be lost in transit. The sender can record success and the receiver can record failure, or vice versa.
The persistent retry queue
The first design decision is making delivery durable. The event is written to a queue (a database table is fine; you do not need Kafka) and the queue is the source of truth for what has and has not been delivered. The dispatcher reads pending events, attempts delivery, and updates the row based on the outcome. The dispatcher is replaceable: if it crashes, a new one comes up, reads the unfinished events, and resumes.
The queue table is something like (id, event_id, target_url, payload, attempts, next_attempt_at, status). The status is one of pending, delivered, permanently_failed. A row stays in the table after delivery for a retention window (a week is reasonable) so that customers who suspect a delivery problem can be shown the trace.
Retries and backoff
When delivery fails, the question is whether to retry and when. The answer depends on the failure type. Network errors and 5xx responses are retried; 4xx responses (the receiver said the request is malformed) are not. The line between them is sometimes blurry: a 401 might be a temporarily-rotated key, a 429 is an explicit retry signal, a 500 might be a permanent bug or a transient outage. The default is to treat 4xx as permanent and 5xx as transient, with 429 and 408 as explicit retries.
Backoff is exponential with full jitter. The wait between attempt N and attempt N+1 is a random number between zero and min(cap, base * 2^N). The cap is one hour. The base is one second. After a few failures, retries are scattered across an hour-wide window, which prevents thundering-herd retries when an outage ends.
The retry budget is the total number of attempts, which is finite. Twelve attempts spread over six days is a reasonable default. After the budget is exhausted, the row moves to permanently_failed, and the alert goes to a human.
Idempotency at the receiver
The receiver has to assume that any webhook may arrive more than once. The fix is idempotency keyed on the event ID, not the request ID. Each webhook payload includes a stable event ID (UUID, a database PK, anything unique to the underlying event). The receiver records the IDs it has processed and refuses to process the same one twice.
The naive implementation is a unique constraint on the event ID column in some processed-events table. Insert; if the unique constraint fires, the event was already processed, return 200. The implementation has to be careful about transactions: the unique-constraint insert and the actual side effects of the event have to be in the same transaction, or you can record processed without doing the work, or vice versa.
For receivers that cannot make their side effects transactional (sending an email, calling an external API), the pattern is the outbox: process the event, write the side effects to a local queue, return 200. A separate worker drains the local queue. The visible effect (the email being sent) is now decoupled from the webhook acknowledgment, and an outbox row that has not been processed is a clear signal that something needs attention.
Out-of-order delivery
Webhooks arrive in roughly the order they were generated, but not exactly. Two events generated 50ms apart on the sender can arrive in the opposite order at the receiver, especially after retries scatter them. This is sometimes catastrophic: a "subscription cancelled" event followed by a "subscription created" event with the same subscription ID, processed out of order, leaves the system thinking the subscription is active.
The fix is for events to carry a timestamp from the sender's clock and for the receiver to apply them in timestamp order. If a "cancelled" event arrives before the "created" it cancels, the receiver buffers the cancelled until the created arrives, then applies them in order. This works only when the receiver is willing to wait briefly; webhooks that arrive seconds apart are fine, webhooks that arrive minutes apart are edge cases. For genuinely out-of-order traffic, the answer is to model the event as a state transition rather than a deterministic operation: the receiver has the resource's full state in each event, and applies the most recent.
Verifying the truth
The last layer is verification. The sender signs every payload with a shared secret (HMAC-SHA256 over the body and timestamp), and the receiver checks the signature before processing. This is not really a delivery problem, but it interacts with one: a replay attack delivers a webhook the sender never sent. The signature catches it; the timestamp window (reject anything older than five minutes) catches retries that travel through an attacker-controlled cache.
The signature also catches misrouting. If a webhook ends up at a wrong URL because of misconfiguration, the signature will fail and the event will be dropped. Without the signature, the wrong receiver might happily process events for a different tenant.
The honest summary
Reliable webhook delivery is a layered defense, not a single mechanism. The sender uses a durable queue and exponential backoff. The receiver uses idempotency keys and an outbox. Both sides timestamp their work and accept that the network occasionally lies. The system as a whole tells smaller lies than any single component.
This is the architecture behind WebhookVault: a durable capture queue, exponential retry on forwarding failures, signature verification on the inbound side, and full request-response logs that let customers investigate which lies the network told today. The goal is not to prevent lies; the goal is to make them legible.