Why Your Retry Logic Makes Outages Worse: Thundering Herds and Exponential Backoff

Your service goes down for thirty seconds. Everything recovers and comes back up. Then it immediately goes down again. If that's happened to you, your retry logic did it.

The thundering herd

Ten thousand clients are talking to your API. The API restarts. All ten thousand clients get a connection error at roughly the same time. All ten thousand clients wait a fixed interval — say, one second — and retry at roughly the same time. The API starts up and immediately receives ten thousand concurrent requests, which it cannot handle, and crashes again.

This is the thundering herd problem. It happens when requests that fail synchronously are retried synchronously. The timing that caused the original failure is preserved — or made worse — by uniform retry intervals.

Exponential backoff without jitter is also wrong

The standard advice is exponential backoff: retry after 1 second, then 2 seconds, then 4 seconds, doubling each time. This reduces load over time but does not solve the synchrony problem. All ten thousand clients that failed at the same time will still retry at the same time — they've just agreed to do it at 1 second, then 2 seconds, then 4 seconds. The thundering herds are smaller but they still come in synchronized waves.

The fix is jitter: add random delay to the backoff interval so clients desynchronize.

import random
import time

def retry_with_backoff(fn, max_attempts=5, base_delay=1.0, max_delay=60.0):
    for attempt in range(max_attempts):
        try:
            return fn()
        except Exception as e:
            if attempt == max_attempts - 1:
                raise
            # Full jitter: random between 0 and cap
            cap = min(base_delay * (2 ** attempt), max_delay)
            delay = random.uniform(0, cap)
            time.sleep(delay)

The "full jitter" approach (random between zero and the capped backoff) performs better than "equal jitter" (random between half and the full cap) at spreading load. AWS documented this in 2015 and the analysis has held up. Use full jitter.

What to retry and what not to

Retry is only safe for idempotent operations. If your POST creates a record, retrying it creates two records. The operations you can safely retry:

GET requests (reads with no side effects)
PUT requests that are truly idempotent (setting a value to a specific state)
Any request where your server returns a 503 or 429 with a Retry-After header explicitly inviting retry

Operations you should not retry automatically without idempotency keys:

POST requests that create resources
Payment operations
Email sends
Any operation that triggers an irreversible side effect

Budget your retries

There's a subtler failure mode: retry budgets. If 30% of your requests fail and you retry each one three times, your backend sees 130% of nominal load during the failure and 400% after the failure recovers (all the queued retries arriving at once). Cap your total retry count. Cap your retry window. If retries would push total traffic above your service's headroom, shed load instead of retrying.

Circuit breakers

A circuit breaker stops retrying entirely after a threshold of failures, gives the downstream service time to recover, and only resumes traffic after a probe request succeeds. It's the right pattern when the failure is total (service is down, not degraded) and when the retry cost — in load or side effects — exceeds the value of the result.

The correct mental model: retry is for transient network noise. Circuit breakers are for service failures. Exponential backoff with full jitter is for everything in between. "Retry on failure with no further thought" is the one pattern that reliably makes things worse.

Building something? Prove it.

builds.anethoth.com — public build dossiers for software in progress.