Distributed Rate Limiting: Token Bucket, Sliding Window, and the Algorithms That Scale
Rate limiting from a single application instance is a solved problem; rate limiting across a fleet of instances with consistent enforcement is harder than it looks. The algorithm choice matters less than the coordination strategy, and the patterns that work treat consistency as a tunable knob r
Rate limiting in a single-process application is a textbook problem: maintain a counter per caller, decrement it on each request, reject when it hits zero, refill over time. There are four well-known algorithms (fixed window, sliding window, token bucket, leaky bucket), each with slightly different burst-vs-smoothness tradeoffs, and any of them works fine when the state lives in a single process's memory.
Rate limiting across a fleet of instances is a different problem. The counter state now needs to be shared across processes, and every implementation choice has a cost: round-trips to a shared store, eventual-consistency windows where the count is wrong, hot keys where many requests touch the same counter, failure modes where the store becomes unreachable. We have implemented rate limiting at each of these scales on DocuMint, CronPing, FlagBit, and WebhookVault, and the patterns that work consistently treat consistency as a tunable knob rather than a goal.
The four core algorithms
The four algorithms are different shapes of the same underlying mechanism. Fixed window counts requests in discrete time buckets (e.g., requests in the current minute), rejects when the count exceeds a threshold, resets at the boundary. The trap is the boundary effect: a caller can fire 2X the limit by firing X requests at the end of one window and X at the beginning of the next.
Sliding window approximates a continuously-moving boundary by interpolating between the current and previous fixed windows. If 80% of the current minute has elapsed, the effective count is 20% of the previous minute plus 100% of the current minute. The boundary effect is smoothed away at the cost of needing two counters per caller.
Token bucket maintains a virtual bucket that fills at a constant rate up to a maximum capacity. Each request consumes one token; requests are rejected when the bucket is empty. This naturally allows bursts up to the bucket capacity while enforcing a long-run rate equal to the refill rate. It is the right algorithm for almost every API use case because it matches how customers actually use APIs (mostly steady with occasional bursts).
Leaky bucket is the dual: requests fill a bucket and leak out at a constant rate. If the bucket overflows, requests are rejected. This forces a constant output rate regardless of input rate, which is useful for downstream systems that have absolute throughput limits but is usually not what you want for an API.
For most APIs the choice is between sliding window and token bucket. Sliding window is easier to explain to customers ("100 requests per minute, smoothed across the window"). Token bucket is easier to reason about as an engineer ("refill rate, burst capacity"). The math works out roughly equivalent at the long-run rate; the difference is whether bursts are explicit (token bucket) or implicit (sliding window).
The shared-state problem
The interesting engineering problem is sharing the counter state across instances. Three patterns dominate.
The first is centralized counter in a shared store. Redis is the canonical choice because it has atomic INCR, EXPIRE, and Lua scripting for compound operations like check-and-decrement. Each instance does a Redis round-trip on every request to consult and update the counter. This is the most accurate approach (the counter is always consistent) and the most expensive (every request pays a round-trip latency).
The latency of the Redis round-trip is usually one to three milliseconds in the same datacenter, which is a substantial fraction of the latency budget for a fast API. The throughput is bounded by Redis's single-threaded model, which caps out at low hundreds of thousands of operations per second per instance. At larger scale you shard the rate-limit keys across multiple Redis instances, which works because rate-limit operations are naturally partitioned by caller.
The second pattern is in-memory counters with periodic synchronization. Each instance maintains a local counter, increments it on local requests, and periodically (every second, or every N requests) pushes updates to a shared store and reads back the consolidated count. This is much faster per request (in-memory operations only) at the cost of consistency: the global count is up to one synchronization window stale, so a caller can exceed the limit by up to (instance count × per-instance burst) before all instances notice.
The third pattern is consistent-hash routing where each caller is routed to the same instance for every request. Each instance owns the counter state for its assigned callers. This is the fastest and most consistent of the three (single-instance counter math, no shared state) but requires that the routing layer be aware of the rate-limit identity, which is not always the case. It also fragile under instance churn: if instances are added or removed, callers get reassigned and the counters get redistributed in ways that can produce brief windows of misenforcement.
The hot-key problem
Even with a shared store, individual rate-limit keys can become hot. A single tenant who is being aggressive (or being attacked) produces all of their requests against a single counter, which becomes the bottleneck. Redis can handle tens of thousands of operations per second on a single key, but high-traffic abusers can produce more than that on a single counter.
The mitigation is local-then-global: each instance does a local check first, and only consults the shared store if the local check passes. The local check uses a small fraction of the global budget (say, 1/N of the budget where N is the instance count, or a fixed small value), and the shared store is only consulted for requests that would not be rejected by the local check anyway. This collapses the per-key load on the shared store by approximately the instance count.
The downside is that a caller can briefly exceed the global limit by up to (instance count × local-check limit) before the shared store catches up. For most APIs this is acceptable: the goal of rate limiting is to prevent abuse and protect downstream resources, not to enforce a precise count.
The failure mode story
The shared store will sometimes be unreachable. The question is what the rate limiter does in that case. Two answers, neither obvious.
Fail-open: if the rate-limit store is unreachable, allow the request. This protects availability at the cost of letting abusers through during outages. For most SaaS APIs this is the right default, because rate-limit failures during a backing-store outage are not the kind of failure that customers can do anything about, and rejecting requests adds insult to injury.
Fail-closed: if the rate-limit store is unreachable, reject the request. This protects downstream resources at the cost of customer-visible failures during outages. It is the right default only for APIs whose downstream resources cannot tolerate any unbounded traffic — payment processing, regulatory-bound operations, expensive third-party API calls.
The hybrid is to use a circuit-breaker pattern: fail-open for the first N seconds of a store outage (assume the outage is transient), then fail-closed if it persists. This buys the availability benefit during normal blips while preventing extended abuse during long outages.
The metric tracking question
Rate limiting only works if you can observe it. Three signals to track.
First, the 429 rate per endpoint and per caller. A rising 429 rate against a specific endpoint usually means either a customer integration is misconfigured or the endpoint's limit is too low. A rising 429 rate against a specific caller usually means abuse or a runaway customer process.
Second, the latency of the rate-limit check itself. Rate limiting adds latency to every request; if the check itself is slow, the rate limiter is the cause of API slowness rather than a tool against abuse. We typically alert when the rate-limit check exceeds 10ms at p99, which is well above the normal latency.
Third, the burst-vs-steady ratio per caller. Callers whose traffic is mostly bursts are different from callers whose traffic is steady; the same rate limit affects them very differently. This signal is useful for tuning limits and for diagnosing customer complaints — a customer who is hitting the limit because their traffic is bursty is a different problem than one who is hitting the limit because their steady-state load is too high.
The deeper observation
Rate limiting is one of those features where the simple version is easy and the production version is genuinely difficult. The textbook algorithms are not the hard part; the hard part is the coordination across instances, the failure modes when the coordination layer is unreachable, the hot-key problem with abusive callers, and the trade-offs between consistency and latency that make the easy answer in a textbook the wrong answer in production. The teams that have working rate limiting have it because they have made these trade-offs explicitly rather than discovering them under load.