Designing Webhooks That Customers Actually Trust
The webhook is the most fragile API surface most products ship. The difference between a webhook system customers trust and one they curse is not in the protocol — it is in the operational discipline around delivery, signing, replay, and the dashboard you give them when something goes wrong.
The webhook is the most fragile API surface most products ship. Unlike a request that the client makes and receives a response to, a webhook is a request the server makes that may or may not arrive, may or may not be signed in a way the recipient understands, may or may not arrive in order, and may or may not be replayable when something goes wrong. The default state of a webhook system is "the customer does not trust it," and turning that around takes real engineering work.
This piece is about the design choices that make webhooks trustworthy. None of them are exotic. All of them are the kind of choice that, missing, generates support tickets for years.
The signature: pick one and document it
Every webhook should be signed. The signature should be computed over the raw bytes of the request body, not the parsed JSON, because parsing is implementation-defined and signatures over parsed data are signatures the client cannot easily verify. The standard pattern is HMAC-SHA256 over the body with a per-customer secret, with the signature placed in a header like X-Webhook-Signature.
The implementation detail that matters most is what you publish to your customers. A page that says "we sign with HMAC-SHA256, here is example code in Python, Node, Go, Ruby, and PHP" turns signature verification from a ten-hour debugging session into a ten-minute integration. Stripe's documentation is the reference here. If you make customers reverse-engineer your signing scheme from response headers, you have made the wrong choice.
A second detail: include a timestamp in the signed payload, and reject requests where the timestamp is more than a few minutes old. This prevents replay attacks, and it is a one-line addition to the signing process.
The retry policy: predictable and visible
Network failures happen. The customer's endpoint will be slow, will return a 500, will be down for maintenance. The retry policy is what determines whether these failures cause data loss or just temporary delay.
The pattern that works is exponential backoff with jitter, capped at a reasonable maximum. A common shape is: retry after 30 seconds, 1 minute, 5 minutes, 15 minutes, 1 hour, 4 hours, 12 hours, 24 hours. After 24 hours of failure, mark the webhook as failed and stop retrying. The total budget is around three days, with the bulk of the retries in the first few hours.
What matters more than the exact numbers is making them visible to the customer. The customer should be able to log into a dashboard and see, for any webhook, exactly how many delivery attempts happened, what the response status was on each, and what time the next attempt is scheduled for. Without this visibility, the customer's only signal that something is wrong is "the data did not arrive," and they have no way to tell whether your system is going to retry or has given up.
Idempotency: the customer's problem you can mitigate
Even with perfect retry logic, webhooks can be delivered more than once. The network can drop a 200 response, your retry kicks in, and the customer receives the same event twice. This is unavoidable in any at-least-once delivery system, which is what every reasonable webhook system is.
The mitigation is to include a stable event ID in every webhook payload, and to document loudly that customers must use it for deduplication. {"event_id": "evt_abc123", "type": "user.created", "data": {...}} is the canonical shape. If the customer receives the same event_id twice, they should ignore the second one.
The event ID should be the same across retries of the same event — that is the whole point. If you regenerate the ID on retry, you have given the customer no way to deduplicate. This is a surprisingly common bug.
Replay: the dashboard feature that earns trust
The single feature that converts customer skepticism into trust is replay. Give the customer a dashboard where they can see every webhook your system sent them, view the request body and response, and replay any of them on demand. When something goes wrong on their end — a deploy that broke their handler, a database that was down, a bug that dropped events — they can replay the missing events themselves without filing a support ticket.
The implementation is straightforward: store every webhook delivery attempt with its body, headers, and response for some retention window (we use 30 days). Provide an endpoint that re-sends a stored webhook to the original URL. Provide a UI button that calls that endpoint. The total amount of code is small. The amount of trust it earns is large.
Ordering: the lie you should not tell
The most common false promise in webhook documentation is "events are delivered in order." Almost no webhook system actually delivers in order under all conditions, because retries break ordering. Event A is sent, fails, queued for retry. Event B is sent, succeeds. Event A is retried later, after Event B. The customer received them out of order.
The honest answer is to tell customers events may arrive out of order, and to include a timestamp on every event so they can sort if they need to. Some customers will need ordering; for them, the answer is to fetch state from your API after receiving any webhook, which is consistent regardless of delivery order.
Some platforms guarantee in-order delivery within a single resource by serializing dispatch per resource. This works but limits throughput per resource and adds complexity. It is the right choice for a small set of integrations and the wrong choice as a default.
The customer-facing dashboard
The single thing that separates webhook systems customers trust from ones they curse is the customer-facing dashboard. The dashboard should show, for each customer endpoint:
- The recent delivery history with status codes and timestamps
- The success rate over the last hour, day, and week
- The retry queue depth (how many webhooks are pending retry)
- The full request body and response for each delivery
- A button to manually replay any delivery
- A signing secret rotation flow
This is what WebhookVault is built around — capture, inspect, replay, and the visibility customers need to debug their integrations themselves rather than file tickets. The dashboard is the product, more than the protocol is.
The deeper lesson
Webhooks are a contract between your system and code you do not control, running in environments you cannot debug, on schedules you do not set. Every design choice should make that contract more transparent: signed payloads so the recipient knows the message is genuine, stable event IDs so they can deduplicate, exponential retries so transient failures do not cause permanent data loss, and a dashboard that shows them exactly what your system did and gives them the tools to recover when something on their end broke. Get those right and the webhook stops being a fragile bridge and becomes the kind of integration surface customers trust enough to build on.