Database Query Timeouts: The Single Knob Most Production Systems Get Wrong

Statement timeouts protect databases from runaway queries that turn slow into catastrophic. Most teams set them once, set them too high, and discover the consequences during an incident. The patterns that work, the misconfigurations that compound, and how to set timeouts that actually corr

A database query timeout limits how long a single statement can run before the database aborts it. Set too high, a single slow query can hold locks long enough to cascade through every connection in the pool. Set too low, legitimate work gets interrupted and the application logs fill with timeout exceptions. Most teams never adjust the default and discover during their first major incident that the default was wrong for their workload.

We've spent enough time with timeouts across DocuMint, CronPing, FlagBit, and WebhookVault to know that the relevant knob is not "what is the global timeout" but "what is the timeout for this specific workload."

The four timeouts that matter

Every database has more timeout knobs than people realize. The four that materially affect production behavior:

Statement timeout kills queries that run too long. In PostgreSQL this is statement_timeout; in MySQL it is max_execution_time for SELECT and various other knobs for writes; in SQLite it is the busy timeout plus application-layer enforcement. This is the knob that most directly prevents a single query from holding resources indefinitely.

Lock timeout kills queries that wait too long for a lock. PostgreSQL's lock_timeout is separate from statement_timeout: a query can be patient about its own work but impatient about waiting for someone else's. This matters because lock waits are usually the wrong reason to be slow; the application would rather fail fast and retry than block.

Idle-in-transaction timeout kills connections that have started a transaction and then gone idle. PostgreSQL's idle_in_transaction_session_timeout is the safety net for application bugs that fail to commit or roll back. Without this, a connection in a failed code path can hold a transaction open for hours, blocking VACUUM and other maintenance.

Connection idle timeout closes connections that have been idle too long. This is mostly the connection pool's job, but the database may have its own. The value matters less for correctness than for resource accounting; a misconfiguration here surfaces as occasional connection storms on cold starts.

Why the defaults are wrong

The PostgreSQL default for all three timeouts is zero, meaning unlimited. This is correct for the general-purpose database server that needs to support unknown workloads but is wrong for any production application. The application knows things the database does not: that no analytical query needs more than 30 seconds, that no user-facing endpoint should hold a lock for more than 100 milliseconds, that a transaction that has been idle for 10 seconds is almost certainly the result of an application bug.

The right approach is to set the timeouts at the session level for each class of workload, not at the server level. A reporting query needs a different timeout than a user-facing read. A background batch job needs a different timeout than a webhook handler. Setting one global value means picking the maximum of all workload requirements, which means losing the protection the timeout was supposed to provide for the smaller workloads.

Per-workload timeout patterns

Setting per-workload timeouts requires that the application know what workload class it is in. We use four conventions, applied via SET LOCAL at the start of the transaction:

User-facing reads: statement_timeout of 5 seconds, lock_timeout of 1 second. The user is waiting for a response; if the query can't complete in 5 seconds, the right answer is to return an error so the application can render a degraded experience rather than a hung page.

User-facing writes: statement_timeout of 30 seconds, lock_timeout of 5 seconds. Writes sometimes legitimately need more time (a complex INSERT with cascading triggers, for example) but should not wait long for locks.

Background jobs: statement_timeout of 5 minutes, lock_timeout of 30 seconds. Background work can afford more patience but should still have an upper bound. The 5 minute value comes from our largest legitimate background query plus margin.

Reporting and analytics: statement_timeout of 30 minutes, lock_timeout of 1 minute. Reports run on a replica where lock contention is rare and long-running queries are expected. The replica's WAL replay can be paused if needed, so the long timeout is acceptable.

The retry trap

The most common mistake with timeouts is to set them aggressively and then add automatic retries. The cure is worse than the disease: a query that always takes 10 seconds and times out at 5 seconds will, with three retries, take 15 seconds of database time plus the latency of the failed attempts. The database does more work, the customer waits longer, and the load pattern looks like a thundering herd whenever the workload spikes.

The correct response to a timeout is usually to fail the operation and surface a meaningful error to the customer or the calling system. Retries belong only for known-transient failures (a serialization conflict, a deadlock victim, a brief lock wait) and need exponential backoff plus a strict retry budget.

What the operator should monitor

Timeouts are a feedback mechanism: they tell you that something is taking longer than you expected. The four signals that tell you whether the timeouts are calibrated:

Rate of statement_timeout errors per workload class. A baseline rate is normal and useful (it tells you the timeout is doing its job). A sudden rate increase signals that some queries have started to take longer than they used to, which is almost always a missing index or a runaway data growth.

P99 latency by workload class. If the P99 is approaching the timeout, the timeout is about to start firing routinely. Raise the timeout temporarily, fix the underlying slowness, then lower it back.

idle_in_transaction count and oldest age. A nonzero count is normal; an old age is an application bug.

Lock wait time by query. PostgreSQL exposes this in pg_locks; a few high-wait queries usually point to a missing index or a hot row.

The deeper observation

Timeouts are one of the few places where the right configuration depends on the application's understanding of its own behavior. The database cannot know that 30 seconds is too long for a user-facing read or too short for a nightly report; only the application can. Treating timeouts as a single knob to set once and forget is the source of most of the surprises that follow. Treating them as a per-workload contract — and reviewing the contract whenever the workload changes — is the discipline that prevents the contract from breaking silently in production.

Read more