API Throttling vs Rate Limiting: Two Patterns Often Confused

Throttling and rate limiting look similar from the outside but solve different problems. Confusing them produces APIs that are technically protected but practically unusable, or APIs that look fair but quietly let one customer take down a shared dependency.

The two patterns get used interchangeably in casual conversation and produce real confusion in design discussions. Rate limiting is a contractual mechanism: it caps how many requests a caller can make in a window, and the cap is part of the API's stated behavior. Throttling is an operational mechanism: it slows down or rejects requests when the system is under stress, regardless of which caller is making them. The two patterns coexist in a healthy API, but they answer different questions and have different failure modes when they're confused with each other.

The patterns in this post apply across the four products in our studio — DocuMint, CronPing, FlagBit, and WebhookVault — and to any HTTP API that has more than one customer.

Rate limiting: the contract

Rate limiting is what the API documentation tells callers about. "Free tier: 60 requests per minute. Pro tier: 600 requests per minute. Enterprise: custom." The numbers are part of the product. Callers design their integrations around the limits, expect the limits to apply consistently, and treat exceeding the limit as a billing or planning question rather than a failure mode.

The implementation is per-caller and per-window. Each API key has a counter, the counter increments on each request, and when the counter exceeds the cap for the window, subsequent requests get a 429 Too Many Requests response with a Retry-After header. The window can be sliding (more accurate, more state) or fixed (simpler, slightly less fair at boundaries). The counter can live in memory (single instance), in a shared store like Redis (multi-instance), or in the database (durable but slower).

The discipline of good rate limiting is that the limit is predictable, that the headers communicate the current state (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset), and that the limit is per-tier rather than per-account so that customers can plan around it. The limit is enforced before any work is done, so that exceeding the limit doesn't cost the API any meaningful resources.

Rate limiting is a contractual mechanism. It's the API saying "I will accept this much from you." It applies even when the API has plenty of capacity, because the contract isn't about capacity, it's about fairness and tier differentiation.

Throttling: the safety valve

Throttling is what the API does to protect itself when something goes wrong. Throttling is not in the documentation, or if it is, it's mentioned as an emergency measure, not a normal part of the experience. A caller who hits a throttle is being told that the API is sick, not that the caller is exceeding their limit.

The implementation is request-shaped, not caller-shaped. Throttling watches some signal of system stress — average response time, queue depth, error rate, downstream dependency latency, CPU utilization, memory pressure — and reduces the rate at which new work is accepted when the signal crosses a threshold. The reduction can be fail-fast (return 503 immediately for some fraction of requests), backpressure (slow down responses to slow down callers), shedding (drop the lowest-priority requests), or degradation (skip optional work).

The discipline of good throttling is that it activates before the system is dead, that it activates in proportion to the stress signal rather than as an all-or-nothing switch, and that it deactivates as soon as the stress passes. The throttle is the system saying "I'm under load, please come back later," not "you are exceeding your contract."

Throttling is an operational mechanism. It applies even when individual callers are well within their contracts, because the system as a whole has more callers than capacity at this moment.

Where they get confused

The two patterns get confused in three common ways, and each produces a recognizable failure mode.

The first confusion is implementing only rate limiting and treating it as the safety valve. This API responds to overload by accepting all the requests it has the capacity to accept, then collapsing under requests that are within their contractual limits. The contract said the customer could send 600 requests per minute. The customer is sending 600 requests per minute. The API is failing because the database is overloaded, not because anyone is misbehaving. The fix is to add throttling on top of rate limiting, so that contractual limits are honored only when the API has the capacity to honor them.

The second confusion is implementing only throttling and treating it as the contract. This API has no defined limits in its documentation, and customers discover the actual limits empirically by getting 503s. The customer can't plan their integration because the limit changes based on what other customers are doing, and they can't tell whether they're being shaped by a rate limit or by overall system load. The fix is to add rate limiting on top of throttling, so that customers have predictable limits and the throttling only kicks in for stress events that the rate limits couldn't have prevented.

The third confusion is using the same response code for both. A 429 means "you exceeded your rate limit"; a 503 means "the service is unavailable." Mixing them produces caller code that retries the wrong way for the wrong cause: backing off for hours when the throttle was about to clear in seconds, or hammering the API immediately when the rate limit window hadn't reset. The fix is to use distinct status codes (429 for rate limits, 503 for throttling) and distinct headers (Retry-After for both, but with different semantics), and to document the difference clearly so that integrators can build correct retry logic.

The minimum viable combination

The pattern that holds up at our scale is to have both. Rate limiting enforces the contract: each tier has documented limits, the limits are checked first, and exceeding a limit produces a 429 with appropriate headers. Throttling protects the system: when the system shows signs of stress, additional shaping is applied to all callers regardless of tier, and the response is 503 with Retry-After.

The rate limiting is per-API-key and per-window, with a sliding window of 60 seconds for the per-minute limits and a sliding window of 86400 seconds for the per-day limits. The counters live in the application database for our scale; they would live in Redis at higher scale. The headers are emitted on every response, including successful ones, so that callers can track their consumption without an extra request.

The throttling is per-endpoint and per-stress-signal. Endpoints with expensive backends — the PDF generation endpoint in DocuMint, the request capture and replay endpoints in WebhookVault — have throttles that watch the median response time. When the median exceeds a threshold, the throttle starts shedding new requests with 503 and a 30-second Retry-After. When the median returns to normal, the throttle clears.

The deeper observation

The two patterns answer different questions. Rate limiting answers "what is this customer entitled to?" Throttling answers "what can the system handle right now?" Confusing them produces APIs that are either contractually unfair (a customer can be denied service for reasons unrelated to their behavior) or operationally fragile (the API collapses under load that's technically within its limits). Keeping them distinct, in implementation and in response codes, costs almost nothing and produces an API that customers can integrate against confidently and that operators can run safely. The teams that take this distinction seriously rarely have customer-facing reliability incidents that aren't traceable to a clear single cause; the teams that don't have incidents that no one can categorize because the system was both rate-limiting and throttling at the same time and no one can tell which one mattered.

Read more