A Practical Guide to API Rate Limiting
Rate limiting is one of those features that seems simple until you try to build it well. The basic idea — restrict how many requests a client can make — is straightforward. The details are where it ge
Rate limiting is one of those features that seems simple until you try to build it well. The basic idea — restrict how many requests a client can make — is straightforward. The details are where it gets interesting.
Why Rate Limit
Three reasons, in order of importance:
- Protect your infrastructure. A single client making 10,000 requests per second can take down your API for everyone. Rate limiting prevents one bad actor (or one buggy integration) from becoming an outage.
- Enforce fair usage. Your pricing tiers promise different levels of access. Rate limiting is how you deliver on that promise without manually monitoring every account.
- Signal quality. APIs without rate limits feel amateur. Rate limits, paradoxically, make your API feel more professional — they signal that other people are using it and that you care about uptime.
The Three Algorithms
Token Bucket
Imagine a bucket that holds N tokens. Each request removes one token. Tokens refill at a fixed rate. When the bucket is empty, requests are rejected until tokens refill.
class TokenBucket:
def __init__(self, capacity, refill_rate):
self.capacity = capacity
self.tokens = capacity
self.refill_rate = refill_rate # tokens per second
self.last_refill = time.time()
def allow(self):
now = time.time()
elapsed = now - self.last_refill
self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
self.last_refill = now
if self.tokens >= 1:
self.tokens -= 1
return True
return FalsePro: Allows bursts up to the bucket capacity. A client with a capacity of 100 can make 100 requests instantly, then waits for refill. This matches real-world usage patterns — most API clients send bursts, not steady streams.
Con: Slightly more complex to implement correctly in distributed systems.
Fixed Window
Count requests in fixed time windows (e.g., per minute). Reset the counter at the start of each window.
Pro: Dead simple. One counter per key per window.
Con: Boundary problem. A client can make 100 requests at 11:59:59 and another 100 at 12:00:00 — effectively 200 requests in 2 seconds while staying within a "100 per minute" limit.
Sliding Window
Weighted combination of current and previous window counts. Approximates a true sliding window without storing individual request timestamps.
# Weighted rate = previous_count * overlap + current_count
# overlap = (window_size - elapsed_in_current) / window_size
previous_count = get_count(key, previous_window)
current_count = get_count(key, current_window)
elapsed = time.time() % window_size
overlap = (window_size - elapsed) / window_size
effective_count = previous_count * overlap + current_countPro: Smooth. No boundary spikes. Low memory.
Con: Approximation, not exact. Usually close enough.
What We Use
At Anethoth, all four of our APIs use a simple per-key-per-minute counter stored in memory (we're single-process Python apps with SQLite). When we scale to multiple processes, we'll move to Redis-backed sliding windows.
The headers we return on every response:
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 47
X-RateLimit-Reset: 1713456000When rate limited, we return 429 Too Many Requests with a Retry-After header. The error message includes the limit, the reset time, and a link to pricing for higher limits.
Common Mistakes
- Rate limiting by IP address alone. Shared IPs (corporate offices, VPNs, cloud providers) mean many users share one limit. Always rate limit by API key when possible, with IP-based limiting as a fallback for unauthenticated endpoints.
- Silent rate limiting. Dropping requests without explanation is hostile. Always return clear 429 responses with headers that tell the client when they can retry.
- Overly aggressive limits. If your free tier allows 10 requests per minute, you will spend more time handling support tickets about rate limiting than you save in server costs. Be generous with limits, strict with enforcement.
Rate limiting is infrastructure that your users should almost never think about. If they notice it, your limits are too low or your errors are too opaque. The best rate limiter is invisible.