The standard advice for any network call in a distributed system is to set a timeout. The advice is correct as far as it goes; an unbounded wait is a worse failure mode than a fast error. But the way most teams implement timeouts in practice is wrong in a specific and consequential way: they pick a number per call, hardcode it near the call site, and treat each timeout as independent of the others. Under load, this leads to a particular failure pattern where the user request has already given up but the backend is still doing work for it, eventually finishing successfully and producing nothing useful. Multiply this by every layer in the system and a meaningful fraction of your CPU is being spent computing answers nobody is waiting for.
The cure is to think in deadlines instead.
The compounding-timeout problem
Imagine a request flow: the browser calls API gateway, which calls service A, which calls services B and C in parallel, each of which calls a database. Each of those steps has a timeout. The browser has a 30-second wait. The gateway timeout to A is 10 seconds. A's timeout to B and C is 5 seconds each. B and C timeouts to the database are 2 seconds each.
This looks fine. It is not. If A spends 5.5 seconds on B and C in parallel and then 4.6 seconds on its own work, the gateway's 10-second timeout is exceeded only by 100ms, so the gateway gives up. But A keeps running. So does B and C and the database calls. The user has the error already. The work continues for nothing.
Worse: under load, B and C might both be retrying calls to the database. The retries make B and C exceed their 5-second budgets, A exceeds its 10-second budget, the gateway returns an error. But the database is still under load from the retries even after the user is gone. The hot path is computing answers for ghost requests.
The deadline pattern
Replace per-call timeouts with a single deadline per request, propagated as part of the request context. The deadline is the absolute time after which work for this request is no longer useful. Every component receives the deadline and uses it to compute its own remaining budget.
Concretely, the gateway sets deadline = now() + 30s when it accepts the user request. It passes that deadline to A as a header (X-Deadline: 1735689630.123) or via gRPC metadata. A receives the deadline, computes its remaining budget as deadline - now(), and uses that as its own working budget. When A calls B, it passes the same deadline. When B sets its database timeout, it sets it to min(some_max, deadline - now() - small_buffer).
The properties this gives you: the original wait time is honored end-to-end. No layer ever waits longer than the user is waiting. When the user disconnects or times out, every downstream layer sees a deadline in the past and stops immediately. Retries respect the budget; if there are 200ms left and a single attempt takes 500ms, you do not retry.
What the deadline replaces
Deadlines do not replace per-call timeouts as a defense; they replace per-call timeouts as the primary mechanism. You still want a maximum per-call timeout as a safety rail (so a misbehaving downstream cannot bind your worker indefinitely), but the operational timeout is computed from the deadline. The ceiling is the safety net; the deadline is the policy.
The pattern also replaces the impulse to "tune timeouts." Once you have deadlines, the question is no longer "what should B's timeout to the database be?" The question is "how do we apportion the request budget among components?" That is a different and easier question, because it admits a real answer: B should reserve enough time to do its own work plus the maximum useful wait for the database.
Wiring it through
The hardest part of adopting deadlines is plumbing them through the call tree. Most HTTP clients and frameworks do not have first-class support. The mechanics:
Generate at the entry point. The gateway, edge proxy, or load balancer sets the deadline based on a request budget configured per route. Typical values: 30s for human-facing requests, 60s for batch operations, 5s for synchronous internal calls.
Propagate as a header or metadata. An absolute Unix timestamp (X-Deadline: 1735689630.123) is the easiest to reason about across clock-synchronized services. If clocks may drift, propagate a relative remaining budget (X-Deadline-Ms: 28500) and recompute at each hop, accepting that you lose precision for safety.
Read at every hop. Middleware decorates the request context with request.deadline. Every outbound call site queries this and computes its timeout as min(max_per_call, request.deadline - now() - safety_buffer).
Skip retries when the budget is gone. Retry logic checks the deadline before each attempt. If less than the worst-case attempt time remains, do not retry. This is the single highest-leverage place the pattern pays off.
Cancel work in flight. When the deadline expires, signal cancellation. In Go, this is a context cancellation; in Python with asyncio, it is task cancellation; in Java, it is a thread interrupt or future cancellation. The signal must propagate into the call stack so the work can stop, not just be ignored on return.
Retrofitting
You probably do not have deadlines today. Adding them to a running system is gradual, not big-bang. Start at the edge: pick the highest-traffic endpoint, set a deadline at the entry, propagate it through one or two layers, leave per-call timeouts in place as the safety net. Watch the metric "wasted CPU after client disconnect" drop; that is the signal that the pattern is working.
Then push the deadline deeper, layer by layer. Each layer that learns to respect the deadline removes a class of orphaned work. The transition can take months in a large system. The improvement is monotonic: every layer added improves things, no layer added makes things worse. Unlike most distributed-systems refactors, deadlines are an unambiguous gain.
What this teaches
The deeper lesson behind deadlines is that distributed-system call trees do not want per-call configuration; they want budgets that flow with the request. The same principle applies to retries (a retry budget, not per-call retries), to memory (a request memory budget, not per-allocation limits), to logging (a request log budget, not per-line throttles).
The pattern is uncomfortable at first because it requires components to coordinate through context rather than through static configuration. Once it is in place, it makes the system simpler, not more complex: there is exactly one knob per request type, and it has a meaning that survives transit through the architecture.
We use deadlines internally on the four developer APIs we run at DocuMint (PDF generation), CronPing (cron monitoring), FlagBit (feature flags), and WebhookVault (webhook capture and replay). Even at small scale the pattern earns its keep on the webhook replay path, where a slow target endpoint used to bind workers for the full 30-second timeout regardless of whether the originating request was still around. With deadlines, the worker frees as soon as the originating webhook delivery times out upstream.