Designing API Webhook Deactivation: When and How to Stop Calling Endpoints That Persistently Fail
Customers configure webhook endpoints once and forget them. When the endpoint goes away (DNS dies, cert expires, app gets decommissioned, team moves on), the API keeps retrying. After enough failures, the right answer is to stop. The harder question is what counts as enough.
Every webhook provider eventually has to decide what to do with persistently failing endpoints. A customer configured the endpoint, it worked for six months, then their team rewrote the service and never updated the URL. Now every event you emit triggers a retry storm that fails after the retry budget exhausts, plus a notification to a dashboard nobody is watching, plus latency on your delivery pipeline because the failed attempts consume worker capacity. The structurally right answer is to stop. The harder questions are what counts as persistently failing, how the customer learns that you stopped, and how easily they can resume delivery when they fix the endpoint.
Why automatic deactivation is necessary
The naive approach is to never deactivate. The customer configured the endpoint; the customer can decide when to remove it. The arithmetic does not work. A B2B SaaS at moderate scale handles tens of millions of webhook deliveries per month across thousands of customers. The 1-2% of endpoints that are persistently dead represent hundreds of thousands of guaranteed-fail delivery attempts per month, each consuming worker capacity, network egress, retry queue space, and observability budget. The cost is not catastrophic but it is not zero, and it grows linearly with customer count.
The customer-facing consequence is worse. The failure events surface as dashboard alerts, support tickets, and confused emails ("why am I getting 50,000 retry notifications for an endpoint I removed last quarter"). Each one consumes support time, and the resolution is always the same: deactivate the endpoint. Doing this automatically and proactively converts that support load to silent infrastructure work.
What counts as persistent failure
The minimum useful definition is consecutive-failure-count across the retry budget. If the retry policy is 24 hours and 12 attempts, then 12 consecutive 12-attempt failure cascades over 12 days is a reasonable threshold for deactivation. The exact numbers depend on the retry policy and the customer base; the principle is that the failures need to persist long enough that the customer would have noticed and acted if they cared.
The more sophisticated definition adds time and failure-rate dimensions. An endpoint that has been failing 100% of attempts for 7 consecutive days is unambiguously dead. An endpoint that succeeds 50% of the time but fails the other 50% for 30 days is also problematic but for different reasons (possibly a flapping endpoint that needs investigation rather than deactivation). The mature pattern uses different deactivation thresholds for different failure patterns: fast deactivation for 100% failure rate, slow deactivation for partial failure rate.
The right default for B2B SaaS is 7 consecutive days of 100% failure rate with no successful delivery in that window. This is conservative enough that legitimate intermittent failures (cloud provider incidents, customer-side deployment issues, certificate renewals) do not trigger deactivation, and aggressive enough that genuinely dead endpoints get cleaned up within a reasonable window.
The notification surface
Automatic deactivation without notification is hostile. The customer needs to learn that you stopped delivering. The minimum notification surface is three channels: an email to the account contact (or all admins), a dashboard banner that persists until acknowledged, and a webhook event (yes, a webhook about the webhook being deactivated, ideally delivered to other endpoints on the same account so it actually arrives).
The email content should include the endpoint URL, the time range of failures, the most recent few failure responses (status code plus first 500 bytes of body, redacted for sensitive data), and the resumption path. The dashboard should show the same information in more detail plus a one-click "test endpoint" button that the customer can use to verify the endpoint is back before reactivating.
The customer-facing language matters. Phrases like "Your endpoint has been deactivated" trigger defensive responses; phrases like "We have paused delivery to this endpoint due to repeated failures" are accurate and less adversarial. The goal is to communicate that the deactivation is a service to the customer (you are no longer being spammed with failed-delivery notifications) rather than a punishment.
The resumption mechanism
Reactivation should be one click for the customer once they have verified the endpoint is back. The dashboard pattern is a "test endpoint" button that sends a synthetic ping event, followed by a "resume delivery" button that becomes available once the test succeeds. The two-step flow forces the customer to verify before they ask us to resume sending real events.
The backfill question is harder. What about the events that we did not deliver during the deactivation window? Three options. First, do not backfill; the customer accepts that events during the dead period are lost. Second, backfill from a configurable point (last 24 hours, last week). Third, give the customer a list of missed event IDs and let them request replays selectively. The right default for B2B SaaS is no automatic backfill with explicit replay available; backfilling 7 days of events without the customer asking for it can produce surprising retry storms on systems that have moved on.
Three patterns that fail
First, deactivation thresholds set too aggressively. We have seen providers that deactivate after 24 hours of failure, which catches every transient cloud-provider incident plus every customer maintenance window. The right threshold is conservative enough that "had a bad day" does not trigger deactivation; the customer's confidence in the webhook system is more important than the small cost of a few extra days of guaranteed-fail attempts.
Second, silent deactivation with no notification. Several large providers have done this historically; the result is universal customer rage and a steady flow of support tickets asking "why did webhooks stop working." The notification cost is small (one email plus one dashboard banner); the customer-trust cost of silent deactivation is large.
Third, deactivation that requires complex reactivation. We have seen providers where reactivation requires deleting and recreating the endpoint, which loses the endpoint configuration and any associated subscriptions. The right pattern is reactivation as a state change, not a recreation: the endpoint configuration persists, the subscriptions persist, only the delivery flag flips back to active.
Our use across the four products
WebhookVault, as the most webhook-centric product, has the most sophisticated deactivation policy: 7 consecutive days of 100% failure rate triggers deactivation, with email + dashboard banner + reactivation flow. CronPing has webhook alerts that benefit from the same treatment; an alert endpoint that has been dead for a week probably is not coming back, and we deactivate on the same 7-day schedule. FlagBit's webhook subscriptions for flag-change events use a longer 14-day window because flag changes are less frequent and the cost of false-positive deactivation is higher.
DocuMint has no outbound webhooks (it consumes Stripe webhooks but does not emit any), so the deactivation policy does not apply. The Stripe inbound side is the mirror-image problem: we are the receiver, and Stripe handles deactivation policy on their side based on our response codes. The implication is that response-code discipline matters; returning 5xx for application bugs that should be 4xx can trigger Stripe to deactivate our webhook, which is a much worse failure mode than a few thousand failed delivery attempts.
The deeper observation
Webhook deactivation policy is one of those API design decisions that compounds. Get it right and customers never notice; the dead endpoints get cleaned up silently and the delivery pipeline stays clean. Get it wrong and you accumulate either a permanent overhead of failed deliveries (no deactivation) or a steady stream of support tickets from customers surprised that their webhooks stopped (aggressive deactivation). The right answer is the conservative one with clear communication: deactivate after the failures persist long enough that the customer must have known, notify through three channels so the deactivation is impossible to miss, and make reactivation a one-step process so the customer can resume delivery without friction. The pattern is durable across the four products because the underlying customer behavior (configure once, forget, eventually rewrite or remove the receiver) is durable.
Our products: DocuMint (PDF invoice generation API), CronPing (cron job monitoring with status pages), FlagBit (feature flags API for modern teams), and WebhookVault (webhook capture and replay) put these patterns into production.