Why Your Connection Pool Needs a Health Check: Stale Connections Kill Throughput

The failure mode is invisible until it isn't. You deploy on Friday afternoon, traffic drops to near-zero overnight, and Monday morning's first requests take 15 minutes to timeout instead of 50ms. Your on-call phone starts ringing. The connection pool looks healthy — it shows 20 connections — but none of them actually work.

The Stale Connection Problem

TCP connections in your pool can die silently. Network address translation tables, firewalls, and cloud load balancers all have idle connection timeouts. Common values:

AWS Network Load Balancer: 350 seconds
AWS Application Load Balancer: 60 seconds (idle)
Many corporate firewalls: 60–300 seconds
Linux netfilter (conntrack): 5 days default, but often overridden by cloud providers

When the firewall drops a connection, neither side is notified. Your application's TCP socket still exists. The pool still considers the connection valid. The database server may also have no idea the connection is gone — the OS will silently discard any packets the client sends.

When a thread borrows that connection and sends a query, it enters the TCP retransmit cycle. On Linux with default settings (tcp_retries2=15), the kernel retransmits for up to 15 minutes before giving up with ETIMEDOUT. The thread is blocked for all of that time. Meanwhile, the pool is one connection smaller.

Why Monday Morning Is Worst

Stale connections accumulate during quiet periods. If your pool holds 20 connections and they're all idle for 10 minutes while your AWS NLB has a 350-second timeout, you might have 15 stale connections by the time Monday morning traffic arrives.

The burst of requests borrows stale connections. Each borrowed stale connection blocks its thread. The pool drains to the surviving good connections — maybe 5 — and then those 5 have to serve the full load while 15 threads are stuck in TCP retransmit. If those 5 become stale before requests finish, the pool goes to zero and everything hangs.

This is also exactly what happens on post-deploy restart when the pool is fresh but the first query pattern is bursty.

Three Approaches, Ranked

1. Validate on borrow. Before handing a connection to a thread, verify it's alive. HikariCP does this with JDBC4's isValid(). PgBouncer uses server_check_query = 'SELECT 1' (or server_check_delay). This adds one round-trip per borrow — typically under 1ms on a local network — but guarantees the connection is live before the application query runs. This is the most reliable approach.

2. Background eviction. Periodically close idle connections that have been open past a threshold. HikariCP's maxLifetime and idleTimeout; PgBouncer's server_idle_timeout. This amortizes the validation cost across many borrows, but there's always a window between the last eviction check and the next borrow where a connection can go stale. Works well as a complement to validate-on-borrow, not a replacement.

3. TCP keepalive tuning. Set tcp_keepalive_time, tcp_keepalive_intvl, and tcp_keepalive_probes at the OS level or per-socket. Linux defaults: keepalive starts after 7200 seconds of idle — far too long. Setting tcp_keepalive_time=60 with tcp_keepalive_intvl=10 and tcp_keepalive_probes=5 means a dead connection is detected within ~110 seconds. This operates at the transport layer, below your pool, and catches the connection-dropped-mid-borrow case. But it's coarse — it fires for all sockets on the system or requires per-socket configuration in your connection library.

Use all three. They're not alternatives.

What Health Checks Don't Catch

A healthy connection doesn't mean your queries will succeed. Health checks verify the TCP connection and database server reachability. They don't verify:

Server-side resource pressure — the connection is alive but queries queue behind a long-running transaction holding locks
Replication lag — the connection is healthy but you're reading from a replica that's 30 seconds behind
Schema state — the connection is healthy but you're running against a schema version that doesn't match your application code post-migration

These require application-level checks, not pool-level checks. The pool health check answers one question: is this TCP connection live and the database server accepting queries. Everything above that layer is your problem.

Read more at anethoth.com — or explore indie SaaS projects at builds.anethoth.com.