Designing API Webhook Delivery Partitioning: How to Scale Per-Tenant Without Cross-Tenant Interference

A webhook delivery system that works at hundreds of subscriptions per second eventually meets a customer doing tens of thousands of events per second. The arithmetic of total delivery throughput stops being the binding constraint and per-tenant isolation becomes the new problem. The right architectural answer is partitioning, but the partitioning decisions compound and are hard to reverse. The patterns that survive are worth knowing before the first scaling incident forces the redesign.

The interference problem

A naive webhook delivery system maintains a shared queue of pending deliveries and a pool of worker processes that pull from the queue and POST to receiver URLs. The arrangement works well when delivery volumes are roughly comparable across tenants. The arrangement fails when one tenant suddenly produces orders of magnitude more deliveries than typical.

The failure mode is head-of-line blocking. The shared queue fills with the burst tenant's deliveries. Other tenants' deliveries wait behind the burst. The wait time for non-burst tenants grows from milliseconds to seconds to minutes as the queue depth grows. The non-burst tenants see degraded delivery latency through no action of their own. The behavior is the cross-tenant interference that partitioning is designed to prevent.

The failure mode is worse when the burst tenant's receiver is slow. Each delivery attempt occupies a worker for the duration of the HTTP request to the receiver. A receiver responding at p99 of 30 seconds ties up workers for 30 seconds per attempt. The worker pool is sized for typical receivers responding in under a second. A burst tenant with a slow receiver can saturate the worker pool within minutes and leave no workers available for any other tenant.

The partitioning question

The partitioning question is: what is the unit of isolation? The three candidates are per-tenant, per-subscription, and per-receiver-URL. Each has different operational and cost characteristics.

Per-tenant partitioning isolates customers from each other. A burst from tenant A does not affect deliveries for tenant B. The implementation is straightforward: each tenant has its own queue and its own worker pool. The cost is operational complexity: the system has thousands of queues rather than one, the worker pool sizing per tenant is a non-trivial question, and the cross-tenant resource sharing during quiet periods is awkward.

Per-subscription partitioning isolates webhooks within a tenant. A subscription that is slow does not affect other subscriptions for the same tenant. The implementation is finer-grained than per-tenant: each subscription has its own queue. The cost is even higher operational complexity. The benefit is that customers with multiple subscriptions get internal isolation, which matters when one subscription is a test endpoint and another is production.

Per-receiver-URL partitioning isolates around the receiving server. A slow receiver does not block deliveries to other receivers. The implementation matches the actual failure mode of slow downstream services. The cost is that the partitioning unit is determined by customer URL choices rather than by tenant identity, which produces unpredictable resource usage.

The hybrid pattern

Most production webhook systems converge on a hybrid pattern. The top-level partitioning is per-tenant for billing isolation and per-customer SLA enforcement. Within each tenant, deliveries are further partitioned per-subscription. The worker pool is shared across tenants with per-tenant rate limits that prevent any single tenant from monopolizing the pool.

The implementation is a single logical queue with per-tenant and per-subscription sub-queues plus a scheduler that pulls from sub-queues in a fair-share order. The scheduler is the key component: it implements the policy that ensures cross-tenant fairness and within-tenant isolation. The most common scheduler implementations are weighted round-robin and deficit round-robin, both of which give each tenant a guaranteed minimum share of the worker pool.

The per-tenant rate limit is the safety valve. A tenant that exceeds its allocated share has its deliveries throttled to the share level. The throttling produces an internal queue that fills until the burst subsides. The non-burst tenants see no impact because their share is protected by the scheduler.

Worker pool sizing

The worker pool sizing question is harder than it appears. The naive answer is to size for peak load: enough workers to handle the maximum delivery rate without queue buildup. The naive answer overprovisions during quiet periods and is expensive at large scale.

The right answer is to size for the desired steady-state latency and provide elastic scaling above the baseline. The baseline pool handles the median delivery rate at the target latency. The elastic capacity handles bursts up to a configured cap. The cap exists because elastic capacity is more expensive per unit and unlimited elasticity is a budget risk.

The sizing depends on the receiver response time distribution. Workers are tied up for the duration of the HTTP request. A receiver responding in 100ms supports 10 deliveries per worker per second. A receiver responding in 1 second supports 1 delivery per worker per second. The order-of-magnitude difference dominates worker pool sizing and is one of the reasons that receiver response time is one of the metrics that webhook providers monitor most carefully.

Per-subscription concurrency limits

Per-subscription concurrency limits cap the number of in-flight deliveries to a given subscription. The cap exists for two reasons. The first is receiver-side rate limiting: most receivers cannot handle unbounded concurrent inbound requests and cap themselves implicitly via connection limits or rate limits. The second is provider-side protection: a subscription with a slow receiver should not be allowed to tie up unbounded worker resources.

The default concurrency limit varies by provider. Stripe defaults to 1, which produces strictly serialized delivery and matches the per-resource ordering guarantee. GitHub defaults to a small constant. Linear and most other providers default to a few concurrent connections. The right default for a B2B SaaS webhook system is between 1 and 5, with customer-controlled adjustment available for receivers that can handle more.

The interaction with ordering is important. A concurrency limit of 1 produces strictly serialized delivery within the scope of the limit. A concurrency limit above 1 produces concurrent delivery and breaks ordering. The interaction means that ordering guarantees and concurrency limits are coupled: a provider that promises per-resource ordering must enforce concurrency 1 within each resource.

The hot-tenant problem

The hot-tenant problem is the case where one tenant's delivery volume is high enough that it does not fit cleanly within a per-tenant allocation. A tenant producing tens of thousands of events per second is a candidate for special handling rather than the default per-tenant share.

The first option is to increase the per-tenant allocation for the hot tenant. The allocation is a configuration value that can be set per-tenant. The increase is operationally simple but the allocation must be paid for by reducing allocations elsewhere or by adding capacity. The accounting is usually a per-tier feature where higher-tier customers get larger allocations.

The second option is to spread the hot tenant's deliveries across multiple shards. The implementation requires partitioning the per-tenant queue across multiple physical queues with a consistent-hash routing on subscription ID or event ID. The sharding adds operational complexity but allows the tenant's delivery volume to scale horizontally beyond the capacity of a single queue.

The third option is to give the hot tenant a dedicated infrastructure tier. The implementation runs a separate delivery pipeline for the hot tenant with its own queue, worker pool, and scheduler. The isolation is total but the cost is high and the operational complexity is substantial. The tier exists in some enterprise webhook offerings as a premium-priced option.

The geographic partitioning question

Geographic partitioning is the question of whether to run delivery infrastructure in multiple regions. The benefit is reduced delivery latency for receivers in regions other than the primary region. The cost is operational complexity of running multi-region infrastructure.

The benefit is largest when the receiver population is geographically distributed and the typical receiver response time is dominated by network round-trip rather than receiver processing time. A receiver in Europe receiving deliveries from a North American queue sees 100-200ms of round-trip time per delivery. The same delivery from a European queue sees 10-20ms of round-trip time. The 10x improvement matters at scale.

The cost is operational. Multi-region delivery requires per-region queue infrastructure, cross-region replication of subscription metadata, and a routing layer that places each delivery in the right region. The implementation is substantially harder than single-region delivery and the operational surface is correspondingly larger. The right time to add geographic partitioning is after the latency improvement justifies the operational cost, which is typically at scales of millions of deliveries per day rather than thousands.

Three patterns that fail

The first pattern that fails is partitioning by random hash of event ID. The pattern distributes load evenly across shards but breaks per-tenant isolation: a burst tenant's deliveries land in all shards and degrade all shards equally. The right pattern is partitioning by tenant or subscription, which keeps interference local.

The second pattern that fails is uncapped per-tenant scaling. A tenant that produces an unexpected burst should be throttled at the per-tenant cap rather than allowed to consume unbounded resources. The right pattern is explicit per-tenant caps with customer-visible warnings when the cap is approached and customer-visible errors when the cap is hit.

The third pattern that fails is rebalancing during incidents. A burst tenant during an incident is the worst time to rebalance partition assignments because rebalancing typically requires draining in-flight work and adding load to the already-stressed system. The right pattern is fixed partition assignments with manual rebalancing during quiet periods only.

Our use across the four products

Our four products implement different webhook delivery patterns reflecting different scale profiles. DocuMint receives Stripe webhooks rather than sending them and so does not participate in delivery partitioning. CronPing, FlagBit, and WebhookVault all send webhooks and share the delivery infrastructure across the three products.

The current architecture is a single shared worker pool with per-tenant fair-share scheduling. Each tenant has a default allocation of 10 concurrent deliveries with customer-configurable adjustment up to a tier-dependent cap. The fair-share scheduler is a deficit round-robin implementation that gives each tenant their allocation per scheduling round.

The peak delivery volume across the three products is hundreds of deliveries per second, which fits within a single-region single-pool architecture. The next scaling step is per-tenant queue partitioning with shared worker pool, which we plan to implement before delivery volumes reach the low thousands per second. The geographic partitioning is not planned until volumes reach the low tens of thousands per second and the receiver distribution justifies the operational complexity.

The deeper observation is that webhook delivery partitioning is one of the architectural decisions where the wrong call at small scale produces a system that becomes very expensive to restructure at large scale. The right call is to design for per-tenant isolation from the start even if the implementation initially runs everything in a single physical queue. The logical architecture matters more than the physical architecture for future scaling because the logical architecture is what customers experience and what the operational team must reason about during incidents.

Our products: DocuMint (PDF invoice generation API), CronPing (cron job monitoring with status pages), FlagBit (feature flags API for modern teams), and WebhookVault (webhook capture and replay) put these patterns into production.