HTTP Keep-Alive and Connection Reuse: The Hidden Latency Lever
Every HTTP request that opens a new TCP connection pays for the three-way handshake plus the TLS handshake, which on a typical internet path is 100-200 milliseconds. Keep-alive amortizes that cost across many requests on the same connection, but only if every layer of your stack actually r
The cost of an HTTP request is dominated, on a typical internet path, by the network setup rather than the actual data transfer. A TCP three-way handshake takes one round trip. A TLS 1.3 handshake takes one more (or zero with TLS 1.3 0-RTT, but that has its own constraints). On a 50ms RTT path that is 100ms before any application bytes flow. For a request that returns a few kilobytes of JSON, the setup is more expensive than everything else combined. HTTP keep-alive — the mechanism that lets multiple requests share a single TCP connection — exists because every request paying that setup cost from scratch would be a performance disaster.
Most engineers know keep-alive exists. Far fewer have audited their stack to confirm that it actually works end to end. This post covers the mechanism, the layers where reuse can break, and the operational signals that tell you whether your stack is reusing connections or quietly opening fresh ones for every request. The patterns apply across all four products in our studio — DocuMint, CronPing, FlagBit, and WebhookVault — and especially to WebhookVault's outbound forwarding, where the difference between reusing connections and opening new ones is the difference between sub-second delivery and multi-second delivery on busy endpoints.
What keep-alive actually does
HTTP/1.1 made persistent connections the default. The Connection: keep-alive header is technically redundant in HTTP/1.1 but is still commonly sent. The mechanism is straightforward: instead of closing the TCP connection after the response, the server keeps the connection open for some configurable timeout. If the client sends another request on the same connection before the timeout, the new request avoids both the TCP and TLS handshake. The connection is closed when either side decides it has been idle long enough, or when an explicit Connection: close header signals the end of reuse.
HTTP/2 takes this further. A single TCP connection multiplexes many concurrent requests, with framing that interleaves request and response data on the same connection. The setup cost is paid once and then amortized across the lifetime of the connection. HTTP/3 (QUIC) collapses the TCP and TLS handshakes into a single round trip and supports 0-RTT for repeat connections, but the underlying principle of amortizing setup cost across many requests is the same.
The savings are substantial. On a path with 100ms total handshake cost, ten requests on a fresh connection each take 110ms. Ten requests on a reused connection take 110ms for the first and 10ms each for the rest, for a total saving of 900ms — roughly 90% of the wall-clock time. For high-fanout systems that make many small requests, this is the difference between a performant integration and a slow one.
The layers where reuse can break
The first place reuse breaks is in client libraries that close the connection after every request. The Python requests library, for example, opens a fresh connection for every requests.get() call unless you use a Session object. The requests.Session object has a connection pool and reuses connections across calls, but a code base that imports requests and calls requests.get() directly never reuses anything. Fixing this is usually a matter of replacing requests.get with a long-lived session.get, but the change has to be made everywhere, and one direct requests.get call in a hot path can dominate the latency budget.
The second place is in HTTP/2 client libraries that do not actually use HTTP/2. Many libraries that advertise HTTP/2 support do so only when the protocol is explicitly enabled, and the default is HTTP/1.1. The difference is observable in production: HTTP/2 traffic shows one TCP connection with many concurrent streams; HTTP/1.1 traffic shows multiple TCP connections, often one per concurrent request. A library that uses HTTP/1.1 is still reusing connections within each pool slot, but a high-concurrency workload exhausts the pool and starts opening new connections, paying the handshake cost on each one.
The third place is in load balancers and proxies. A load balancer that terminates TLS reuses the inbound TCP connection but may open a fresh connection to the backend for every request. The Caddy reverse proxy that fronts our four products defaults to keep-alive on both sides, but a misconfigured proxy that sets Connection: close on the backend side gives up the win. The same applies to AWS ALB, NGINX, and HAProxy: each has settings that control backend connection reuse, and the defaults are not universally aggressive.
The fourth place is in firewalls and NAT gateways. A connection that has been idle for longer than the firewall's connection-tracking timeout gets silently dropped. The next request on that connection times out instead of failing fast, because the client side does not know the connection was killed. The mitigation is to set the keep-alive idle timeout below the firewall's tracking timeout, or to send TCP keep-alive probes that reset the timer. Most firewalls default to 5-30 minute timeouts; HTTP keep-alive timeouts are usually shorter than this, but a long-running pool that holds connections idle can hit the limit.
Connection pool sizing
A client connection pool has a maximum size, and that size matters. Too small and high-concurrency workloads queue waiting for a connection, even though more connections could be opened. Too large and the pool itself becomes a resource leak — too many open file descriptors, too many TCP slots in the kernel, too many TLS sessions in memory. The right size depends on the workload and the latency profile of the destination.
The starting point for an internal service-to-service pool is roughly twice the steady-state concurrency. For a service that handles 100 RPS to a downstream with 50ms latency, the steady-state concurrency is 5 (Little's Law: 100 * 0.05 = 5), and a pool of 10 connections handles bursts to 200 RPS without blocking. For external integrations where latency is more variable, the pool needs to be larger to absorb the variance: 4-5x steady-state is a reasonable default.
The more important setting is often the per-host limit rather than the total pool size. Python's aiohttp, Node's https.Agent, Go's http.Transport all have separate limits for total connections and for connections per host. A pool that allows 100 total connections but only 6 per host (the historical browser default) gives up most of the reuse win when one host dominates the workload.
The five operational signals
The signals worth monitoring are: TCP connection establishment rate (alert on rates higher than the request rate divided by expected requests-per-connection); TLS handshake count (same logic — should be a small fraction of request count for reused connections); 99th percentile latency for the first request on a connection vs subsequent requests (the gap between them is the handshake cost you are paying); connection pool wait time (alert on non-zero values during peak load); and connection lifetime distribution (a healthy pool shows a long tail of long-lived connections, not a uniform distribution of short ones).
The deeper observation
HTTP keep-alive is one of the highest-leverage performance settings in any system that does network I/O, and one of the easiest to silently lose. A library upgrade that changes default behavior, a load balancer config change that sets Connection: close, a code path that uses requests.get() instead of session.get() — any of these can give back the savings without any obvious symptom in the metrics. The discipline that catches the regression is auditing the connection establishment rate against the request rate periodically and investigating any divergence. A stack that opens one TCP connection per HTTP request is paying handshake cost on every call, and that cost compounds into latency budget that has nowhere else to come from.