engineering

Designing Multi-Tenant Rate Limits: Per-Tenant, Per-Endpoint, and the Patterns That Scale

A single global rate limit lets one noisy tenant slow down everyone else. A per-tenant rate limit lets one expensive endpoint slow down everything that tenant does. The right design is a small matrix of limits at different scopes, and the wrong design is to keep adding global limits one at a ti

Anethoth

13 May 2026 — 4 min read

The first rate limit a team adds is usually a global one: 1000 requests per minute per API key, total. It catches the runaway script that someone forgot to put a sleep in, and it covers most of the cases the team has thought about. It is also the wrong shape, and the reason becomes obvious the first time a single tenant decides to backfill a year of data through your API while ten other customers are trying to use it normally.

The right shape for rate limits in a multi-tenant SaaS is a small matrix: limits at different scopes, applied independently, with the most restrictive winning. We have iterated on this pattern across DocuMint, CronPing, FlagBit, and WebhookVault, and the shape that ages well is the same across all four despite very different traffic patterns.

The four scopes that matter

The first scope is per-tenant total throughput. This is the limit you put on the customer's plan: the Pro tier gets 60 requests per second, the Business tier gets 300, the Free tier gets 5. Its job is fairness between customers, not protection of the system. If one customer is at their plan limit, that is by design; if every customer is at their plan limit simultaneously, that is a capacity problem you need to solve at the system level.

The second scope is per-endpoint per tenant. Expensive endpoints get tighter limits than cheap ones. A PDF generation endpoint that takes 500ms of CPU per request needs a much tighter limit than a status check that takes 2ms. Setting the same per-tenant limit on both means a customer can either DoS your PDF endpoint or be artificially limited on the cheap one. The per-endpoint scope decouples them.

The third scope is per-resource. If a customer can address resources by ID, you almost always need a per-resource limit to prevent one resource from monopolizing the tenant's throughput. A webhook endpoint being hammered should not block the tenant's other webhook endpoints from being managed. A feature flag being evaluated should not block the tenant's other flags. The per-resource limit lives below the per-tenant limit and is usually much smaller.

The fourth scope is system-global. This is the protective floor: regardless of who is making the requests or what endpoint they are hitting, the system as a whole will not accept more than N requests per second. Its job is to keep the database, the worker pool, and the upstream dependencies from being overwhelmed. It should almost never fire; if it fires regularly, you need capacity not limits.

The decision algorithm

The decision is straightforward: the most restrictive applicable limit wins. For each request, check the per-tenant limit, the per-endpoint-per-tenant limit, the per-resource limit, and the system-global limit. If any one is exceeded, reject the request with 429 and an honest Retry-After. The headers should tell the client which limit fired, because the right remediation differs ("you are over your plan limit, consider upgrading" is a different message from "this specific resource is hot, try again in a few seconds").

The implementation of each individual limit is the standard sliding-window counter against a shared store. The non-obvious part is the headers: we return X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset for the limit that is closest to firing, and X-RateLimit-Scope identifying which scope it belongs to. When multiple limits are close, this disambiguates what the client should do.

The cost-vs-count question

Counting requests is the right metric for some endpoints and the wrong metric for others. A status check and a bulk export both count as one request, but their cost differs by three orders of magnitude. For endpoints with highly variable cost, a request-count limit is fairness theater: a customer can use 1000x the resources of another customer while staying under the same request budget.

The right answer for variable-cost endpoints is a credit-based limit. Each request is debited an estimated cost from a per-tenant credit balance, the balance refills at a configured rate, and the limit is enforced against credits rather than requests. The cost can be a constant per endpoint (cheap requests cost 1, bulk requests cost 50) or it can be measured after the fact (the actual milliseconds of CPU time consumed gets debited).

We use the constant-per-endpoint approach across our products because the cost variance within a single endpoint is small enough to ignore. For products with bulk endpoints whose cost depends on the size of the input, the after-the-fact debit pattern is more correct but more operationally complex; we do this only on the bulk endpoints where it matters.

The fail-open question

The store that holds the rate-limit counters can fail. The question every multi-tenant rate limiter has to answer is: when the store is unreachable, do you fail open (allow the request) or fail closed (reject it)?

Fail-open is the safer default for SaaS where the rate limit is fairness rather than security. A brief Redis outage that causes ten minutes of unlimited traffic is recoverable; a brief Redis outage that causes ten minutes of rejected requests is a customer-visible incident. Failing closed only makes sense when the rate limit is protecting something that cannot tolerate the worst-case unmetered traffic, which is rare in B2B SaaS.

The honest fail-open implementation should distinguish between "store is genuinely unreachable" and "this specific key is not in the store yet" — the latter should be treated as a fresh window starting now, the former should bypass the limit entirely. Mixing them up produces hard-to-debug behavior where rate limits randomly relax under load.

The operational signals

Three operational signals to monitor on a multi-tenant rate limiter. First, the 429 rate per tenant, which lets you spot abusive customers and customers about to churn (a customer whose 429 rate is climbing is a customer about to file a ticket). Second, the 429 rate per scope, which tells you whether your limits are calibrated correctly — if the system-global limit fires more often than the per-tenant limits, your global limit is too tight or your tenant limits are too loose. Third, the store latency, because rate-limit checks are on the hot path and a slow store adds latency to every request.

The deeper observation

The right number of rate limits is not zero and is not "as many as we can think of." It is the small matrix of scopes that maps to the actual ways requests can hurt the system and the actual ways customers can fail to share fairly. A team that keeps adding global limits one at a time ends up with a rate limiter that is hard to reason about and that fires for unexpected reasons; a team that designs the scope matrix up front ends up with limits that customers can understand and that operators can tune. The right shape is small and predictable, not exhaustive.

Designing Multi-Tenant Rate Limits: Per-Tenant, Per-Endpoint, and the Patterns That Scale

Anethoth

The four scopes that matter

The decision algorithm

The cost-vs-count question

The fail-open question

The operational signals

The deeper observation

Read more

How Treehoppers Communicate Through Plant Stems: The Strange Substrate-Borne Vibrational Network

The Forgotten History of the Microwave Oven: How Radar Engineering Reshaped the Kitchen

Postgres pg_settings: Reading and Reasoning About Configuration at Runtime

Designing API Webhook Payloads: Snapshots vs References and the Right Default for B2B SaaS