The symptoms are consistent: a connection that has been idle for some period of time fails on the next query with a broken pipe, connection reset, or EOF. The application reconnects and the query succeeds. Intermittent, impossible to reproduce under load, always happens after an idle period. You assume it's a fluke and move on.
It's not a fluke. It's infrastructure with a timeout shorter than your idle period, and at least three different systems may have set that timeout without asking you.
TCP keepalive defaults
TCP connections don't know if the other end is still there unless someone sends a packet. If a connection is idle long enough, an intermediate device — a load balancer, a NAT gateway, a firewall — will silently drop it. TCP keepalives are the mechanism for preventing this: periodic empty packets that keep the connection alive without application-level activity.
Linux's default keepalive settings:
tcp_keepalive_time: 7200 seconds (2 hours) before the first keepalive probetcp_keepalive_intvl: 75 seconds between probestcp_keepalive_probes: 9 probes before declaring the connection dead
Two hours before the first probe. This is fine for connections between servers in the same data center. It is not fine when there's a load balancer in the path with a 60-second idle timeout.
Load balancer idle timeouts
AWS ALB default idle timeout: 60 seconds. A connection idle for more than 60 seconds is silently dropped at the load balancer. The application doesn't know. The database doesn't know. The next query from the application goes into the void.
AWS NLB default idle timeout: 350 seconds for TCP connections. Better, but still shorter than Linux's 2-hour keepalive default.
Other load balancers have their own defaults. GCP's load balancer defaults to 600 seconds. HAProxy's default is 50 seconds for client connections. Each sits below Linux's default keepalive interval, which means by default the keepalive packets arrive after the load balancer has already closed the connection.
PgBouncer
If you're using PgBouncer, add a third timeout to the list. PgBouncer's server_idle_timeout defaults to 600 seconds — connections to the actual Postgres server that have been idle for 10 minutes are closed. The application's connection to PgBouncer may remain open; the underlying server connection does not.
This creates a situation where the application believes it has a live connection, PgBouncer closes the server-side connection, and the next query through PgBouncer gets a fresh connection while the application sees nothing unusual. Except when it doesn't, and the timing means the closure happens mid-query, and the application sees an error.
The full chain
A typical path from application to database might traverse:
- Application (no idle timeout configured)
- Connection pool (idle timeout: varies, often 10-30 minutes)
- PgBouncer (server_idle_timeout: 600s default)
- AWS ALB (idle timeout: 60s default)
- Postgres server (no idle timeout on active connections by default)
The shortest timeout wins. With default ALB settings, any connection idle for 60 seconds is a broken connection waiting to be discovered.
The fix
Set TCP keepalive at the application level, below the lowest timeout in your infrastructure chain. For a stack with an ALB, keepalive should fire well under 60 seconds — 30 seconds is a reasonable starting point.
In psycopg2:
conn = psycopg2.connect(
dsn,
keepalives=1,
keepalives_idle=30,
keepalives_interval=10,
keepalives_count=3
)
In SQLAlchemy:
engine = create_engine(
dsn,
connect_args={
"keepalives": 1,
"keepalives_idle": 30,
"keepalives_interval": 10,
"keepalives_count": 3
}
)
In Node.js with pg:
const pool = new Pool({
connectionString: dsn,
connectionTimeoutMillis: 5000,
idleTimeoutMillis: 25000, // expire idle connections before ALB does
});
The Node.js pg pool's idleTimeoutMillis is actually the better fix for that stack — it removes idle connections from the pool proactively before the load balancer closes them.
Application-level ping vs TCP keepalive
Some connection pools support a "test on borrow" query — a SELECT 1 or equivalent before returning a connection from the pool. This works but adds latency to every connection acquisition. TCP keepalive is better when the infrastructure chain allows it: it keeps the connection alive continuously rather than testing it at checkout time.
The two mechanisms solve slightly different problems. Keepalive prevents silent closure by intermediate infrastructure. Test-on-borrow detects closed connections at checkout. For robust handling you often want both: keepalive to prevent the problem, and reconnection logic to handle cases where keepalive wasn't enough.
The timeout you never set was set by someone else. Find out what it is before your next incident.
Building something? builds.anethoth.com is a public build ledger — proof that a product is really being built. Free to list your project.