Connection Timeouts: Read, Write, Connect, Idle, and the Misconfigurations That Compound
Most timeout problems aren't about a missing timeout. They're about the wrong timeout configured at the wrong layer, or a default value inherited from a library that doesn't match the network reality. The four common timeout types behave differently and compound in ways that produce mysterio...
Every network call has at least four implicit timeouts: how long to wait for a TCP connection, how long to wait for the first byte of a response, how long to wait between bytes during a streaming response, and how long an idle connection can sit before it's closed. Most code treats these as a single number, or worse, doesn't set them at all and inherits whatever the underlying library decided. Both approaches produce systems that work fine in development and fail mysteriously in production, where the network is slower, more variable, and full of intermediaries that close connections without warning.
We've debugged timeout-related incidents across DocuMint, CronPing, FlagBit, and WebhookVault, and the pattern that recurs is that the timeout itself was never the bug. The bug was the misalignment between timeouts at different layers, or the default value that nobody had explicitly chosen.
The four timeouts
Connect timeout bounds the TCP handshake and TLS negotiation. On a healthy network this is tens of milliseconds for nearby hosts and 100-300ms for transcontinental ones. The right value is something like 5-10 seconds — enough to absorb transient network hiccups but short enough to fail fast when a host is genuinely unreachable. The wrong value is no timeout at all, which is the default in many HTTP clients and which causes calls to a dead host to hang for the OS-level TCP timeout (typically 75 seconds on Linux, much longer on some platforms).
Read timeout bounds the wait for response data. There are two interesting variants: time-to-first-byte (how long after sending the request before any response data arrives) and inter-byte timeout (how long to wait between successive chunks of response data). Most libraries conflate these into a single read timeout that applies to both, but the two failure modes are different. A slow first byte usually means the server is overloaded; a stalled mid-stream usually means a network problem somewhere in between.
Write timeout bounds the time spent sending request data to the server. For small requests this is rarely interesting because TCP buffers absorb the whole request quickly. For large uploads — say, sending a 50MB PDF for processing — the write timeout matters and is often missing. A stalled write doesn't return an error from the OS for a long time, so the application just hangs.
Idle timeout applies to persistent connections sitting in a connection pool between requests. The value matters because intermediaries — load balancers, NAT gateways, firewalls — typically close idle connections after some interval. If your client's idle timeout is longer than the intermediary's, you'll send requests on connections the network thinks are dead, and you'll get RST or silent drops. The fix is to make the client's idle timeout shorter than the shortest intermediary timeout in the path. Five minutes is a reasonable default; some cloud load balancers default to 60 seconds, in which case 50 seconds on the client is correct.
The defaults that compound
Most HTTP libraries have inconvenient defaults. Python's requests library defaults to no timeout at all. Node.js's built-in http module defaults to 2 minutes for read but no connect timeout. Go's net/http defaults to no timeouts in any direction. Java's HttpURLConnection defaults to infinite timeouts. The cumulative effect is that an application written without explicit timeouts will hang indefinitely on every kind of network failure, and the people writing the application won't know until production.
The mitigation is straightforward but easy to skip: every HTTP client in the codebase should have explicit timeouts set. We use a small helper across the four products that wraps the standard library client with sensible defaults — 5s connect, 30s read, no idle pool reuse for sensitive paths, 50s idle for general paths. Code that needs different values overrides them explicitly. Code that doesn't is at least failing fast instead of hanging.
The compounding problem
Timeouts at different layers compound badly when they're misaligned. A canonical case: the application has a 30-second read timeout, the load balancer in front has a 60-second timeout, and the database has a 90-second query timeout. The application gives up first. The load balancer is still waiting for a response. The database is still running the query. The user retries. Now there are two queries running. Repeat enough times and the database is doing 10x the work, with nobody waiting for any of the answers.
The right ordering is the inverse: the inner layer should have the shortest timeout. The database query times out at 30s, the application gives up at 35s, the load balancer at 40s, the user-visible request at 60s. When the timeout fires at the database, everything above it can give up cleanly. We don't always achieve this in practice — the database query timeout is often the hardest to control — but the principle informs how we tune the other layers.
The retry trap
Timeouts and retries interact in ways that produce most of the worst incidents. A request times out at 5s and retries. The retry succeeds at 8s — but the original request is still running on the server, because the server didn't see the timeout. Now we have duplicate work. If the request had side effects, we have duplicate side effects.
The fix is idempotency keys (which we covered in the previous cycle) plus careful retry policies: don't retry on timeout unless the request is idempotent, don't retry forever, use exponential backoff with jitter to avoid thundering herds, and propagate a deadline downstream so retries don't extend the user's wait. The combination is what makes systems behave correctly under partial network failure.
The five operational signals
Timeouts that fire correctly are an event the system should observe. We track: rate of timeouts per call site (a sudden change indicates a network or upstream problem), p99 of successful call duration (timeouts should be visible at the boundary, with most calls completing far below the limit), retry rate after timeout (a sudden increase usually precedes a thundering-herd incident), connection-pool wait time (long waits indicate timeouts upstream are leaving connections in unusable states), and idle connection failures (RSTs on first use after an idle period mean the idle timeout is misaligned with an intermediary).
The deeper observation
Timeout configuration looks like a tuning problem and is actually a system-design problem. The right values depend on what the call is doing, what the network path looks like, what intermediaries are present, and what the calling code wants to do on failure. Inheriting library defaults is a way of pretending the system doesn't have timeouts when in fact every system has timeouts — the question is whether they're chosen or unchosen. Choosing them, even imperfectly, is one of the highest-leverage things small teams can do for production reliability. Most outages we've seen attributable to timeouts are outages caused by missing or misaligned timeouts, not outages caused by the values being wrong by twenty percent.