Rate Limiting in FastAPI: Patterns That Survive Real Traffic
The rate limiter in your starter template will not survive a real traffic spike. Here is what actually works in production, and the small set of decisions that matter more than the algorithm.
The most common rate-limiting code in FastAPI tutorials is a five-line dependency that increments a Python dict and rejects after N requests. It works on a developer machine and falls over the moment you put a second worker behind a reverse proxy. The patterns that hold up to real traffic are not exotic, but they involve a small number of decisions that the tutorials almost never make explicit.
Decide what you are protecting before you write any code
Rate limits are not one feature; they are a policy that protects different resources differently. A signup endpoint protects your database from synthetic accounts. A search endpoint protects your downstream cache from a thundering herd. A password reset endpoint protects users from account takeover. The same numeric limit on all three is wrong on at least two of them.
Before you choose a library, write down for each public endpoint: who is the limit protecting, what is the worst legitimate burst, and what should the failure mode look like. This document will be five pages long if you have a real API, and it will be the most useful document about your service.
Identify the right key, not the IP
The single most common rate-limiting bug is keying on remote IP without thinking about what that IP represents. If your service runs behind a load balancer, the remote IP is the load balancer; you need to read X-Forwarded-For and trust only the leftmost address from your own infrastructure. If users come from a corporate NAT, an entire office building shares an IP and your per-IP limit will throttle a legitimate team. If users come from a mobile carrier, a million phones may share a single CGNAT address.
For authenticated endpoints, key on the API key or user ID. For unauthenticated endpoints, key on a tuple: IP plus a low-entropy fingerprint (User-Agent class, accepted languages) so that a single shared IP can support multiple legitimate clients without breaking the limit. The naive single-IP key is responsible for most of the false positives in real rate-limiting incidents.
The right algorithm is almost always a sliding window
Token bucket, leaky bucket, fixed window, sliding window log, sliding window counter — the textbooks describe five algorithms; in practice you want sliding window counter for almost every API. Fixed window has a boundary effect where a client can fire 2N requests in two seconds by hitting the end of one window and the start of the next. Token bucket is great for ingress traffic shaping and overkill for an HTTP API. Sliding window counter approximates a true sliding window with two integers, runs in O(1), and behaves predictably under burst traffic.
Implementation is two Redis INCR calls or two SQL UPDATE statements: one for the current window, one for the previous, weighted by how far into the current window we are. Six lines of code. The arithmetic is in the algorithm, not the wiring.
Use a backing store you can share across workers
The in-process dict approach fails the moment you run multiple Uvicorn workers, because each worker has its own dict. Redis is the default; SQLite works for small services if you accept the contention; PostgreSQL works if you already have a connection. The point is that the limiter must see all traffic, not the traffic that happened to land on its own worker.
If you cannot use Redis, the next best option is a small dedicated sidecar that proxies the rate-limit check; if you cannot afford that either, accept that your N-worker service is really N independent rate limiters and pick a per-worker limit accordingly. The trap is using a shared dict and assuming it works because it doesn't crash — it just lets through 4x your intended limit because four workers each accept their own quarter of the traffic.
Headers that tell the truth
Send X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset on every response, not just on the 429. Honest headers let well-behaved clients self-throttle and reduce the load on your limiter. The names are not standardized; pick the GitHub names because they are the most widely recognized, and document them.
On a 429, also send Retry-After with seconds (not a date; clients get the parsing wrong). Make the response body a small structured JSON object with the remaining quota, the reset time, and a permanent link to the rate-limit documentation. Do not return HTML.
Different tiers, different limits, one code path
If you have free and paid tiers, the limits should be configurable per tier, not hardcoded per endpoint. Store the limit in the user record (or the API key record) and read it in the dependency. This means a tier change takes effect on the next request without a deploy, and it means an emergency override (for a legitimate customer hit by a viral mention) is a single UPDATE statement.
Build the limiter once, not per endpoint. A dependency that takes a category name and reads the limit from config beats a copy-paste-per-route approach that drifts within a quarter.
Test the limiter in CI
The rate limiter is the part of your code where bugs are silent until production. Write a test that fires N+1 requests in a tight loop against a test endpoint, asserts that the (N+1)th gets a 429, asserts that Retry-After is present and parses, and asserts that after the window expires, the next request gets a 200. This test will catch every refactor that breaks the limiter, including the ones that "look right."
Across our four products
Every API we run uses some version of these patterns. DocuMint rate-limits the demo PDF endpoint per IP and the authenticated invoice endpoint per API key. CronPing limits monitor pings by token because the natural unit of rate is "this monitor." FlagBit limits flag evaluations per project because that is what scales with traffic. WebhookVault limits per endpoint token to keep one noisy webhook source from starving the rest. Different keys, same algorithm, same plumbing.
The rate limiter is small, undramatic infrastructure. It is also the thing that decides whether your service has good days or bad days when somebody links to it on Hacker News. Build it once, build it correctly, and ignore it for the next two years.