Cache Stampede: Why Your Cache Layer Eventually Causes an Outage

A cache works perfectly until the moment it does not, and the moment of failure is rarely graceful. When a popular cache entry expires under load, thousands of requests can hit the origin simultaneously. The patterns that prevent the outage.

A cache is a performance optimization that quietly becomes a load-bearing system. Once an application has shipped with a cache in front of an expensive operation, the underlying system is sized for the cache-hit traffic pattern, not the cache-miss pattern. The failure mode shows up the first time a popular cache entry expires under load: thousands of requests find the cache empty within milliseconds of each other and all simultaneously hit the origin. The origin, sized for occasional misses, sees a traffic spike of 100x or more and either falls over or returns slowly enough that the next wave of requests piles up behind the first. This is cache stampede, also called the thundering herd problem.

We have seen this pattern across DocuMint, CronPing, FlagBit, and WebhookVault, where caching expensive computations is common. The patterns to prevent it are simple in principle, easy to get wrong in practice, and rarely covered in introductory caching tutorials.

Why the failure is sudden

The intuition for why cache stampede is sudden rather than gradual is worth dwelling on. Under normal traffic with a warm cache, requests for a popular item return from cache in 1-10 milliseconds. The origin computation might take 500 ms, but it runs only when the cache entry expires, which for an entry with a 5-minute TTL means once every 300 seconds.

The instant the cache entry expires, every concurrent request observes a miss. If the application serves 1000 requests per second and the origin takes 500 ms, then 500 requests will fire off origin computations before the first one finishes and writes back to the cache. Each one consumes resources the system was not provisioned for: database connections, CPU, memory. The origin starts queueing or rejecting requests. The cache entry takes longer to populate because the origin is now saturated. New requests arrive faster than they can be served. Latency spikes; errors begin appearing. The cache eventually populates, the surge subsides, and the system recovers — or it does not, and you have an outage.

Pattern one: probabilistic early expiration

The most elegant solution is probabilistic early expiration, sometimes called XFetch. Each cache reader, with some small probability that grows as the entry approaches expiration, decides to recompute and refresh the entry early. The probability is calibrated so that one reader will almost certainly refresh the entry before it actually expires, but the chance of multiple simultaneous refreshes is low.

def get_with_xfetch(key, ttl, compute_fn, beta=1.0):
    entry = cache.get(key)
    now = time.time()
    if entry is None:
        value = compute_fn()
        cache.set(key, value, ttl=ttl, expiry=now+ttl, delta=measured_compute_time)
        return value
    # Probabilistic early refresh
    if random.random() < math.exp(-beta * entry.delta * math.log(random.random()) / (entry.expiry - now)):
        value = compute_fn()
        cache.set(key, value, ttl=ttl, expiry=now+ttl, delta=measured_compute_time)
        return value
    return entry.value

The probability is near zero when the entry is fresh and approaches one as the entry nears expiration. The delta parameter is the measured time the computation takes, which makes expensive computations refresh proportionally earlier. The beta parameter tunes aggressiveness.

The benefit of XFetch is that no single request waits for the recomputation — the refresh happens in the background while serving the still-valid cached value. The cost is implementing the logic and tracking the delta.

Pattern two: single-flight

The simpler and more common pattern is single-flight, where exactly one request is allowed to compute a missing cache entry while others wait for it. The waiting requests do not all hit the origin; they hit a coordination mechanism that knows about the in-flight computation.

The implementation depends on whether the cache is in-process or external. For an in-process cache, Go's singleflight package and Python's asyncio task coordination both work. For an external cache like Redis, a short-lived lock is the standard approach:

def get_with_singleflight(key, ttl, compute_fn):
    value = redis.get(key)
    if value is not None:
        return value
    # Try to acquire lock; 30s timeout matches expected compute time
    lock_key = f"lock:{key}"
    got_lock = redis.set(lock_key, "1", nx=True, ex=30)
    if got_lock:
        try:
            value = compute_fn()
            redis.set(key, value, ex=ttl)
            return value
        finally:
            redis.delete(lock_key)
    else:
        # Wait briefly and retry the cache
        time.sleep(0.05)
        value = redis.get(key)
        if value is not None:
            return value
        # Fall back to serving stale or computing directly
        return compute_fn()

The waiting requests poll the cache rather than the origin. The lock timeout must exceed the expected computation time, but not so much that a dead computation blocks indefinitely. The fallback path matters: if the lock holder dies, waiting requests must eventually proceed.

Pattern three: stale-while-revalidate

The third pattern is more cultural than algorithmic. Instead of treating expiration as a hard boundary, treat it as a hint. After the TTL expires, continue serving the stale value but trigger a background refresh. New requests see slightly-stale data for a short window, but they never wait on a miss.

This is the model used in HTTP's stale-while-revalidate Cache-Control directive and in CDN edge caches. For application-level caches, the implementation is straightforward: store a soft-expiry and a hard-expiry on each entry. Serve the value if the soft-expiry has not passed. If soft-expiry has passed but hard-expiry has not, serve the stale value and enqueue a background refresh. If hard-expiry has passed, treat as a miss and recompute.

The trade-off is that callers tolerate stale data for the soft-to-hard window. For dashboards, search results, and feature-flag evaluations, this is almost always acceptable. For data that must be fresh — billing balances, security state — it is not.

What does not work

Increasing the TTL does not work. It postpones the stampede rather than preventing it, and longer TTLs mean staler data. Adding jitter to individual cache TTLs does not work for popular entries — the entry expires once, not once per client. Capacity-planning the origin for the stampede load does not work economically because the stampede load is hundreds of times the steady-state load. Pinning popular entries to never expire does not work because the underlying data eventually changes and you need a mechanism to update.

Operational signals

The signal to monitor is the ratio of origin computations to cache misses. In healthy operation, this should be near 1:1 — one computation per miss. During a stampede, it spikes to many-computations-per-miss as concurrent requests all see the miss. Specific signals to alert on:

  • Origin request rate vs cache-miss rate divergence — when these stop matching, single-flight is failing.
  • P99 latency spikes correlated with cache TTL boundaries — visible as periodic latency surges every TTL period.
  • Origin queue depth or connection pool saturation immediately after expected expiration times.
  • Sudden drops in cache hit rate that don't match traffic changes.

The deeper observation

Cache stampede is the canonical example of a system whose failure mode appears only under load. In a development environment, with one developer hitting the application, the stampede never occurs because there is no herd. In a staging environment with synthetic load, it may not occur because the load is well-distributed. It manifests in production, on the most popular cache entries, at the worst possible time. The discipline of building for cache stampede is the discipline of imagining what happens when many things happen at once — a habit that applies far beyond caching.

Read more