API Design

Designing API Webhook Subscription Health Scores: Surfacing Integration Quality Without Customer Action

Webhook subscriptions degrade silently. The customer notices when something breaks in production weeks after the integration started failing. A computed health score on each subscription surfaces degradation early without requiring customer instrumentation.

Anethoth

02 Jun 2026 — 6 min read

The webhook subscription that quietly stops working is one of the more frustrating failure modes in B2B SaaS integrations. The customer set up the integration six months ago, it worked, and they moved on. The receiving system was redeployed two months ago and the deployment changed a path. Webhook deliveries have been returning 404 for those two months. Nobody noticed because nobody was checking, and the downstream business consequences are exactly what the integration was supposed to prevent.

The standard fix is automatic deactivation after sustained failure (covered in cycle 235). That handles the catastrophic case but does not catch the degradations that fall short of catastrophic — the subscription that succeeds 60 percent of the time, the one with 30-second response times, the one that started ack-after-processing and is silently dropping events under load. A computed health score per subscription, surfaced in the dashboard and the API, catches these earlier and gives customers something to act on before the catastrophic threshold trips.

What a health score should actually represent

The temptation when designing a health score is to compute a single number from a weighted combination of metrics and call it good. The result is opaque: a 73-percent health score does not tell the customer what is wrong or what to fix. The pattern that works better is a small number of distinct dimensions, each scored separately, with a composite score that defaults to the worst dimension.

The four dimensions we settled on across our products are: delivery success, response latency, schema compliance, and configuration freshness. Each is computed over the past 30 days of delivery attempts, normalized to a 0-100 scale, and displayed with its underlying metric so customers can see what drove the score.

Delivery success is the percentage of delivery attempts that resulted in a 2xx response from the receiver. A score of 100 means every attempt succeeded; a score of 70 means 30 percent failed. The threshold for "healthy" is 95-100, "degraded" is 80-95, and "unhealthy" is below 80. The metric is straightforward to compute and easy for customers to understand.

Response latency is the p95 response time from the receiver, normalized against a 5-second target. A score of 100 means p95 is under 1 second; a score of 50 means p95 is around 5 seconds; lower scores reflect proportionally slower receivers. The threshold for "healthy" is 90-100 (p95 under 2 seconds), "degraded" is 60-90 (p95 2-10 seconds), and "unhealthy" is below 60. Slow receivers approach the platform's timeout cap and produce intermittent failures even when they would succeed given more time.

Schema compliance measures the fraction of recent deliveries that the receiver acknowledged with a response shape we recognize as valid. The platform does not parse customer response bodies in detail, but we do check that 2xx responses have the right content-type and any platform-specific acknowledgment headers. The threshold is 95-100 for healthy and below 80 for unhealthy. The metric catches receivers that have started silently returning HTML error pages with 200 status codes, which is a common failure mode that delivery-success-only scoring misses.

Configuration freshness is the age of the subscription configuration weighted against the rate of platform schema changes. A subscription created two years ago, on an API version that has had three subsequent versions released, scores lower than a subscription created last month. The threshold is age-and-version-relative: brand-new subscriptions score 100, subscriptions one major version behind score 80, two versions behind score 60, three or more score 40. The metric catches subscriptions that customers have set and forgotten and that may be missing features or affected by deprecated behavior.

The composite score and what it should default to

The composite score defaults to the minimum of the four dimension scores, not the average. The rationale is that a subscription with 100 on three dimensions and 20 on one is unhealthy, not "mostly healthy." Averaging would produce a misleading 80, which would not trigger any customer attention. The minimum-based composite makes degraded dimensions visible.

The dashboard surface presents the composite score prominently with a color-coded badge (green 90-100, yellow 70-90, red below 70), and clicking the badge expands to show the four dimension scores and the underlying metrics. The point is to give customers a single number to scan when reviewing many subscriptions, and a detailed breakdown when they want to understand a specific score.

What scoring should and should not do

The score should be advisory, not prescriptive. The platform calculates the score and surfaces it; the customer decides whether to act. The score should not be enforced by the platform as a condition for delivery — automatically pausing or rate-limiting low-score subscriptions removes customer control and creates support tickets when the customer cannot figure out why their subscription is throttled.

The score should be stable over short timescales. A 30-day window smooths over individual failed deliveries and one-off receiver issues; a 24-hour window would react to every transient problem and produce a flickering score that customers learn to ignore. The 30-day window is long enough to be meaningful and short enough to reflect recent behavior.

The score should be available via the API, not just the dashboard. Customers with multiple subscriptions want to query the API for the list of subscriptions below a threshold so they can prioritize cleanup. The pattern is a health_score field on the subscription resource with the four dimension scores nested under health_breakdown.

The score should be backward-compatible with subscriptions that pre-date the feature. The metric for subscriptions without enough recent activity defaults to "insufficient data" with a null score rather than scoring them low. The pattern avoids penalizing legitimately quiet subscriptions and gives customers an indication that the score needs more data to be meaningful.

Notification triggers

The score becomes useful when it triggers notifications proportional to the degradation. The pattern that works:

Composite score drops below 90 — dashboard banner appears, no email.
Composite score drops below 70 — email to owner_email weekly.
Composite score drops below 50 — email to owner_email daily, dashboard banner becomes red.
Composite score at 0 or unhealthy across all dimensions — escalate to automatic pause workflow.

The escalation thresholds give customers visibility before catastrophic failure and several chances to act before the platform takes automated action. The tiered notification matches the urgency of the degradation to the urgency of the notification channel.

Three patterns that fail

The first failed pattern is a single opaque score with no breakdown. The customer sees a score of 67 and has no idea what is wrong. Action requires drilling into delivery logs, which the customer would not be doing if they had time to drill into delivery logs. The breakdown into named dimensions is what makes the score actionable.

The second failed pattern is averaging across dimensions. The customer with 100/100/100/30 is unhealthy on schema compliance, not "mostly healthy at 82.5." Averaging masks the dimension that needs attention.

The third failed pattern is enforced action based on score. Automatic throttling or pausing of low-score subscriptions creates a feedback loop where customers cannot improve the score because the platform is now suppressing the traffic that would generate score evidence. The advisory pattern preserves customer agency.

What we use across the four products

WebhookVault has the most complete implementation: all four dimensions, composite score, dashboard surface with detail expansion, and API field on subscriptions. The pattern was driven by customer support volume — webhook debugging is the product's primary use case, and "why are deliveries failing" is the most common question. The health score gives customers a self-service answer.

CronPing implements the delivery success and response latency dimensions on notification endpoints (the subscription analog for monitor alerts). Schema compliance and configuration freshness do not apply in the same way and are not computed.

FlagBit implements all four dimensions on its flag-change webhook subscriptions. The configuration freshness metric has been particularly useful because FlagBit's API has evolved across multiple versions and customers do not always realize their integration is on an older version.

DocuMint does not have outbound subscriptions and so does not need the feature on the outbound side. The Stripe webhook receiver on the inbound side is monitored separately via the same general approach but with metrics specific to Stripe's response shape.

What this earns over time

The health score is one of those features that is invisible when it works and obviously valuable when it surfaces a problem the customer would not otherwise have noticed. The support ticket pattern shifts from "my integration is broken, why did the platform not tell me" to "I see the score dropped, what do I do." The second question has answers; the first one mostly does not.

The deeper pattern is that webhook subscriptions are the kind of integration that customers set and forget, and the platform's job in that asymmetric relationship is to surface degradation early and give customers something to act on. The health score is a small data structure that does this work consistently across the lifetime of the integration, without requiring customers to instrument their own monitoring of the platform's behavior.

Our products: DocuMint (PDF invoice generation API), CronPing (cron job monitoring with status pages), FlagBit (feature flags API for modern teams), and WebhookVault (webhook capture and replay) put these patterns into production.

Designing API Webhook Subscription Health Scores: Surfacing Integration Quality Without Customer Action

Anethoth

What a health score should actually represent

The composite score and what it should default to

What scoring should and should not do

Notification triggers

Three patterns that fail

What we use across the four products

What this earns over time

Read more

Designing API Webhook Receivers That Survive Replay Storms

The Forgotten History of the Sewing Awl: How a Pre-Industrial Stitching Tool Outlived the Machines That Replaced It

How Dippers Walk Underwater: The Strange Aquatic Adaptations of a Songbird

Postgres Default Privileges: How ALTER DEFAULT PRIVILEGES Solves the Forgotten-Grant Problem