Engineering

Rate Limiting Patterns That Won't Annoy Your Users

Rate limiting is one of those features where the implementation difference between "protects the API" and "annoys legitimate users" is subtle but critical. Get it wrong and your biggest customers hit

Anethoth

24 Apr 2026 — 3 min read

Rate limiting is one of those features where the implementation difference between "protects the API" and "annoys legitimate users" is subtle but critical. Get it wrong and your biggest customers hit walls during normal usage. Get it right and abusers are blocked without legitimate users ever noticing.

The Three Common Algorithms

Fixed Window

The simplest approach: count requests in a fixed time window (e.g., 100 requests per minute, resetting every minute on the minute). Easy to implement but has an edge case: a user can send 100 requests at 11:59:59 and 100 more at 12:00:01, effectively doubling their rate at window boundaries.

Sliding Window

Instead of resetting at fixed intervals, the window slides with each request. If the limit is 100/minute, the system checks how many requests occurred in the last 60 seconds from right now. This eliminates the boundary problem but requires storing individual request timestamps.

Token Bucket

Imagine a bucket that fills with tokens at a steady rate. Each request costs one token. If the bucket is empty, the request is rejected. The bucket has a maximum capacity, allowing short bursts while enforcing an average rate. This is the most user-friendly algorithm because it accommodates natural usage patterns: brief bursts of activity followed by quiet periods.

Most API products — including DocuMint, CronPing, FlagBit, and WebhookVault — use a variant of token bucket because it matches how developers actually use APIs.

Response Headers Matter

The difference between helpful rate limiting and hostile rate limiting is transparency. Always include these headers in every response:

X-RateLimit-Limit: 100        # Total allowed per window
X-RateLimit-Remaining: 73     # Remaining in current window
X-RateLimit-Reset: 1714003200 # Unix timestamp when window resets
Retry-After: 30               # Seconds to wait (on 429 responses)

These headers let clients self-regulate. A well-written integration checks X-RateLimit-Remaining and slows down before hitting the limit. Without these headers, clients can only discover limits by being rejected.

Per-Key vs. Per-IP vs. Per-Endpoint

Rate limiting by API key is the most common approach for authenticated APIs. Each customer gets their own allocation. But consider layering limits:

Per API key: protects against individual customers overwhelming the system
Per IP: protects against unauthenticated abuse (brute-force attacks on login endpoints)
Per endpoint: protects expensive operations (PDF generation costs more than listing invoices)
Global: protects the entire system from cascading failures

A single rate limit strategy is almost never sufficient. The right approach is layered: generous per-key limits for normal operation, tighter per-IP limits for unauthenticated endpoints, and per-endpoint limits for computationally expensive operations.

Graceful Degradation

The harshest response to a rate limit is a flat 429 Too Many Requests. Better alternatives:

Queuing: Accept the request and process it later, returning a 202 Accepted with a status URL. The user is not rejected; they just wait longer.

Reduced functionality: Return cached results instead of fresh data. The response is stale but better than nothing.

Progressive limits: First breach gets a warning header. Second breach slows responses (add deliberate latency). Third breach returns 429. This gives legitimate users a chance to correct their behavior before being cut off.

The Human Factor

Rate limits feel personal. When a developer hits a 429, their first reaction is not "I should optimize my request pattern." It is "this API is broken" or "this API is hostile." Clear error messages help:

{
  "error": "rate_limit_exceeded",
  "message": "You've exceeded 100 requests/minute. Your limit resets in 34 seconds.",
  "docs_url": "https://docs.example.com/rate-limits",
  "upgrade_url": "https://example.com/pricing"
}

Notice the upgrade URL. Rate limits are a natural upsell opportunity. A user hitting limits is, by definition, an engaged user. Make it easy for them to pay for more.

Implementation Tips

Use Redis or an in-memory counter for rate limit state. Do not query your database for every rate limit check — that defeats the purpose. For small-scale APIs with a single instance, Python's slowapi or Express's express-rate-limit work fine with in-memory storage. For distributed systems, Redis with atomic increment and TTL is the standard approach.

Test your rate limits under load before deploying. Theoretical limits and practical limits diverge when network latency, concurrent requests, and clock skew enter the picture. What works in unit tests may fail in production.

Rate limiting is not about saying no. It is about saying "not right now, but here's when and how." The best rate limiters are invisible to well-behaved users and helpful to everyone else.

Rate Limiting Patterns That Won't Annoy Your Users

Anethoth

The Three Common Algorithms

Fixed Window

Sliding Window

Token Bucket

Response Headers Matter

Per-Key vs. Per-IP vs. Per-Endpoint

Graceful Degradation

The Human Factor

Implementation Tips

Read more

How Manatees Sense Currents: The Strange Tactile Engineering of Hydrodynamic Vibrissae

The Forgotten History of the Steam Locomotive: How the Iron Horse Compressed Geography

Postgres pg_class and pg_attribute: Reading the System Catalogs Directly

Designing API Webhook Delivery Receipts: The Audit Trail Customers Build Reports From