Circuit Breakers in Practice: Patterns That Prevent Cascading Failures

The circuit breaker is one of the few patterns in distributed systems where the implementation is genuinely simple and the judgment is genuinely hard. The code is a state machine with three states (closed, open, half-open) and two transitions worth caring about. The hard part is choosing the thresholds, deciding what counts as a failure, and being honest about which dependencies actually deserve a breaker in front of them.

The pattern came from Michael Nygard's Release It! in 2007 and has been productized many times since: Hystrix at Netflix, resilience4j on the JVM, Polly on .NET, opossum on Node, py-breaker on Python. The implementations differ in detail but the state machine is identical. Closed means traffic flows through; open means traffic is rejected without trying; half-open means a small amount of traffic is allowed through to test whether the dependency has recovered.

The cascading failure it prevents

The reason to add a breaker is not "the dependency might fail." Dependencies fail constantly; the right response to most failures is a retry or a graceful error message. The reason to add a breaker is "when the dependency is failing, the failure mode involves my service holding open thousands of connections that will never complete, eating thread pools and connection pools, and cascading the failure to my own callers."

The classic cascading failure looks like this. The downstream API gets slow. Calls start timing out at thirty seconds instead of completing in a hundred milliseconds. Each in-flight call holds a thread, a connection, and an inbound socket. The thread pool fills up. New requests queue. The queue grows. Health checks start timing out. The load balancer marks instances unhealthy. The remaining instances get more traffic and saturate faster. The service is down across the board because of one slow dependency.

A circuit breaker breaks this chain at the first link. When too many calls are timing out, the breaker opens. New calls fail immediately rather than holding a thread for thirty seconds. The thread pool drains. The service stays alive. The slow dependency might still be slow, but the calling service has decided to fail fast on its own terms rather than wait.

What counts as a failure

The first decision is what increments the breaker's failure counter. The naive answer is "anything that is not a 2xx response," and it is wrong in important ways. A 404 on a resource lookup is a normal outcome, not a system failure. A 401 on an expired token is a client problem, not a downstream failure. A 429 rate limit response is information about your own behavior, not the dependency's health.

The right answer is a small whitelist of failure modes: timeouts, connection errors, and a specific subset of 5xx responses (typically 502, 503, 504; sometimes 500). Everything else is a successful call that happened to return an error code. If you count all errors against the breaker, the breaker will open during normal operation, refuse legitimate traffic, and convince its operators that breakers are useless.

Choosing thresholds

The two thresholds are the open threshold (how many failures before the breaker opens) and the recovery interval (how long the breaker stays open before testing again). The naive answer of "5 failures, 60 seconds" is a reasonable starting point and is rarely the right long-term answer.

The open threshold should be based on rate, not absolute count. Five failures in a minute on a service that does one request per second is a 50% error rate; five failures on a service that does ten thousand requests per second is statistical noise. The right metric is "failure rate over a sliding window of N seconds, with a minimum sample size to avoid opening on small samples." A typical setting is "open when failure rate exceeds 50% over the last 30 seconds, with at least 20 samples."

The recovery interval should be tuned to the typical recovery time of the dependency. A database that fails over in 30 seconds wants a 60-second open interval; an external API that takes minutes to recover wants a longer one. Setting it too short means the breaker thrashes between open and closed; setting it too long means recovery takes longer than necessary.

Half-open and the test request

The half-open state is what makes the breaker more than a fancy timeout. After the recovery interval expires, the breaker does not close immediately. It allows one or a small number of requests through to test whether the dependency has recovered. If those test requests succeed, the breaker closes and full traffic resumes. If they fail, the breaker re-opens and waits another interval.

The half-open behavior matters for two reasons. First, it prevents a thundering herd of pending requests when the breaker closes; only the test requests are allowed through, so the dependency is not slammed by the queue that built up during the open period. Second, it provides a clean signal of recovery; the test request either succeeds or fails, with no ambiguity from concurrent traffic.

Where breakers don't belong

The honest mistake is to put a breaker in front of every external call, including ones where the breaker provides no benefit. The breaker has overhead: state to maintain, metrics to emit, logic to evaluate on every call. If the dependency is fast, never times out, and is not on the path to cascading failure, the breaker is dead weight.

The places a breaker earns its keep are calls that are slow, calls that hold scarce resources (threads, connections, file descriptors), and calls whose failure can take down the calling service. Calls to a primary database, an authentication service, a payment processor, an external API. The places it doesn't earn its keep are local in-memory caches, fast in-process libraries, and calls to dependencies that already fail fast.

Observability

The breaker only earns its place if its state is visible. The minimum metrics are state changes (closed-to-open and open-to-closed events), call counts in each state, failure rate inside the closed-state window, and time spent in each state. A breaker that opens silently and stays open for hours is almost as bad as the failure it was meant to prevent; the operations team needs to see when a breaker has opened and have an alert when one stays open longer than expected.

Across the four APIs we run at DocuMint, CronPing, FlagBit, and WebhookVault, we have breakers in front of three things: Stripe API calls (slow under load, holds connections), Listmonk (occasionally restarts), and webhook delivery to customer endpoints in WebhookVault (the whole product depends on this not cascading). Everything else is a plain HTTP call with a timeout. The discipline is that adding a breaker is an explicit decision, justified by a specific failure mode, not a default.

The pattern is simple, the judgment is hard, and the reward is the ability to keep your service alive when something it depends on is dying. That is enough to justify the implementation effort, on the dependencies where it matters.