Designing API Webhook Delivery Failure Escalation: How to Tell Customers Their Endpoint Is Broken
Webhook deliveries fail constantly. Most failures are transient. A small fraction become persistent and the customer's integration is silently broken. The escalation question is when to interrupt with a notification, through which channel, and what the contract is for resumption.
Every webhook system has a delivery failure rate, and the failure rate is not zero. Network blips drop a small fraction of attempts. Receiver-side deployments restart endpoints momentarily and produce 502 or 504 responses. Customer-side application errors produce 500s. The first response to all of these is the same: retry with backoff, deliver eventually, count the success rate over a window. The pattern is standard and works for almost all transient failure modes. What it does not handle is the case where the failures stop being transient and become persistent: the customer's endpoint has been broken for hours or days, every retry is failing the same way, and the events are accumulating in the delivery queue waiting for an endpoint that is not coming back.
The escalation question is when to stop treating failures as transient and start treating them as a problem the customer needs to know about. The cost of getting the threshold wrong cuts both ways. Setting it too aggressively produces false-positive notifications about transient issues, training customers to ignore alerts, which makes the system worse than no alerts. Setting it too conservatively means the customer integration is silently broken for days while events queue up or get dropped, and the eventual discovery is more painful than an early warning would have been. The right threshold is a deployment-time decision that depends on the product context, but the design pattern is consistent across products and worth getting right.
The three escalation stages
The pattern that emerges from observing how Stripe and GitHub and Linear handle webhook escalation is a three-stage model. Stage one is silent retry with exponential backoff, the default behavior for all failures, with no customer-visible signal except the existence of failed attempts in the delivery history if the customer goes looking. Stage two is dashboard-visible warning, where a persistent failure pattern produces a visible indicator in the customer's webhook configuration UI without sending an interrupt-driven notification. Stage three is interrupt-driven escalation via email or in-app notification, where the failure pattern has crossed the threshold for the customer needing to know now rather than next time they look at the dashboard.
The stages map to confidence levels about whether the failure is the customer's problem to fix. Stage one assumes it might still resolve itself. Stage two assumes it probably will not resolve itself but is not urgent enough to interrupt. Stage three assumes the customer needs to act and the cost of waiting until they happen to notice exceeds the cost of an interruption. The thresholds between stages are tunable and worth tuning to match the product's typical recovery patterns, but the structure of three stages with different communication channels is the part that holds up.
What the thresholds should look like
The threshold from stage one to stage two depends on the typical recovery time for transient failures and the typical sustained-failure pattern customers exhibit. The observation across products is that transient failures usually recover within 5-15 minutes of the first failure, sometimes longer for receiver-side deployment patterns. A 1-hour window with 80%+ failure rate is a reasonable stage-two threshold: it tolerates short outages and load spikes without producing a warning, but it does not let a fully-broken endpoint sit in stage one for the whole day. The dashboard indicator is non-interrupting, so the false-positive cost is small.
The threshold from stage two to stage three is higher and depends on what the customer is signing up for when they configure a webhook endpoint. A 24-hour window with 90%+ failure rate is a reasonable stage-three threshold for most B2B SaaS products: it filters out the case where a customer is doing an overnight maintenance window and would not appreciate being woken up about it, but it catches the case where an endpoint has been actually broken since the previous business day. The interrupt-driven notification is more expensive in customer attention than the dashboard indicator, so the threshold is tuned to higher confidence that the customer needs to act.
The notification channels
The dashboard indicator is the cheapest channel and should always be the first signal. The implementation is a banner or alert badge on the webhook configuration screen with the per-endpoint status visible at a glance. The customer who happens to be in the dashboard for any reason sees the indicator and can investigate without needing a separate prompt. The cost is small because the customer was already in the product.
The email notification is the standard escalation channel for stage three. The email should go to the account's notification email or to a configurable webhook-failure-specific recipient, not to the billing email. The content should include the endpoint URL, the failure pattern summary, the time of first observed failure, the current delivery queue depth if applicable, and a direct link to the dashboard. The single most important property of the email is that the dashboard link works and the link goes to a screen that shows the customer what to do, not to a generic dashboard home page.
The in-app notification is the right pattern for products with a notification surface in the UI. It overlaps with email for the actual content but provides a second channel for customers who happen to be using the product at the time of the failure. The right composition is dashboard plus email plus in-app for stage-three escalations, with the customer able to suppress any individual channel via notification preferences.
The contract for resumption
Once the escalation has fired, the question is what counts as resolution. The naive answer is one successful delivery, but this is wrong: a single successful delivery after hours of failures could be a brief recovery in an otherwise broken pattern, and resetting the escalation state too quickly produces alert thrashing. The right pattern is a sustained-recovery threshold: some number of consecutive successful deliveries or some duration of failure rate below a threshold before the escalation state resets. A 10-minute window with under 10% failure rate is a reasonable resumption threshold for most products.
The dashboard should show the escalation state and the resumption progress, so the customer who fixes the endpoint can see that the fix is being recognized rather than wondering whether the system is still treating their endpoint as broken. The transparency reduces support burden because customers can self-verify the fix without needing to ask. The resumption notification is optional but produces good will when it is included; a brief follow-up that the endpoint has resumed normal operation closes the loop on the original escalation email.
The catastrophic-failure escape hatch
The auto-deactivation feature discussed in cycle 235 is the escape hatch for the case where escalation has fired, the customer has not responded, and the endpoint has been broken for an extended period. The threshold there is days not hours, and the action is to pause delivery rather than to delete the subscription. The pause-and-notify pattern preserves the customer's configuration so resumption is one click, and it caps the operational cost of an indefinitely broken endpoint at a finite number of retries.
The interaction between escalation and auto-deactivation is sequential. Stage one to stage two warns silently. Stage two to stage three notifies actively. Stage three to deactivation pauses delivery if the active notifications were ignored. The progression maps to escalating confidence that the customer either cannot or will not fix the endpoint, with the action becoming more drastic at each stage. The deactivation is the cap rather than the goal.
Three patterns that fail
The first failure pattern is single-threshold escalation, where the system goes directly from silent retry to interrupt-driven notification with no intermediate stage. The result is either too many false-positive interruptions if the threshold is aggressive or too many silently-broken endpoints if the threshold is conservative. The three-stage model is more code but the customer experience is dramatically better.
The second failure pattern is dashboard-only escalation without email or in-app notification for stage three. The customer who has not visited the webhook dashboard for a week does not see the indicator, and the broken endpoint stays broken. The dashboard indicator is necessary but not sufficient for catastrophic failures; the interrupt-driven channel is what makes the escalation actually escalate.
The third failure pattern is escalation per delivery attempt rather than per endpoint. The customer with a broken endpoint receiving hundreds of webhook events per day gets hundreds of failure notifications, which trains them to filter the notifications and miss the escalation when it actually fires. The per-endpoint aggregation with rate-limited notifications is the right level of grain.
Our use across the four products
WebhookVault is the product where escalation matters most because the entire product is webhook-shaped. The implementation uses the three-stage model with 1-hour stage-two threshold and 24-hour stage-three threshold and 7-day deactivation threshold, with email plus in-app for stage three and dashboard for all stages. CronPing emits webhook notifications when monitors miss schedule, and the same escalation model applies to the webhook endpoint configured to receive those notifications, with the wrinkle that a failing notification endpoint produces a meta-problem of failing-to-notify-about-failure. FlagBit's webhook surface is smaller and the escalation thresholds are tighter at 30-minute stage two and 12-hour stage three. DocuMint is on the receiver side for Stripe webhooks and the escalation model applies in reverse, with the Stripe-side retry policy as the input.
The shared escalation infrastructure across our four products is a per-product configuration of the three-stage thresholds, plus a shared notification dispatch module that handles the email and in-app pathways uniformly. The deeper observation is that escalation is the part of webhook delivery that customers do not think about until it matters, and the asymmetry between the cost of building escalation well and the cost of not building it at all favors building it well from the start. The customer experience of a webhook system without escalation is fine until the first time their integration breaks silently for a week, and then the system loses credibility in a way that is hard to recover from.
Our products: DocuMint (PDF invoice generation API), CronPing (cron job monitoring with status pages), FlagBit (feature flags API for modern teams), and WebhookVault (webhook capture and replay) put these patterns into production.