Database Connection Limits: The Hidden Cost of Every New Feature

The first time a team hits the database connection limit it usually feels like a mystery. The application has been running fine for months. A small feature ships, traffic increases slightly, and then suddenly half the requests start timing out with "too many connections" errors that point at the database. Restarting the application fixes it temporarily. Restarting again fixes it less well. By the third restart the team is in incident mode and the post-mortem starts drafting itself.

The mystery dissolves once you know how connection accounting actually works. PostgreSQL ships with max_connections=100 by default. Every connection consumes about 10MB of RAM on the server, regardless of whether it is doing useful work. Every framework's connection pool is configured to hold open some number of connections, and the default is often higher than the dedicated user thinks. Every new worker process or serverless invocation opens its own pool. The math is additive and unforgiving.

The connection budget

The discipline that prevents this class of incident is to maintain an explicit connection budget. The budget is a single number: the database's max_connections, minus reserved slots for superuser and monitoring. Every component that connects to the database is allocated a portion of that budget. The sum of all allocations must be less than the budget, with headroom for restarts.

The rough allocation looks like this for a typical application. The web tier gets the largest share, allocated as (number of web workers) times (pool size per worker). The background job workers get their own allocation. Migrations, cron jobs, and admin tools each get a small dedicated allowance. Monitoring agents (Datadog, New Relic, custom dashboards) often consume more connections than teams realize, especially if they run query-level instrumentation.

The trap with serverless and autoscaled workloads is that the worker count is unbounded. Every Lambda invocation that opens its own connection during a traffic burst contributes to the count. The fix is either to use a connection pooler in front of the database (PgBouncer in transaction mode, or RDS Proxy) or to use a serverless-aware pool that aggressively closes connections after each invocation. The naive approach of opening a per-invocation connection works at low traffic and fails catastrophically when traffic spikes.

The connection storm pattern

The most common failure mode is not gradual saturation but sudden storm. The pattern is: a brief disruption causes the application tier to lose its existing connections, the application immediately tries to reconnect, the reconnect attempts arrive at the database faster than the database can accept them, the database stops responding to new connections, the application interprets that as more failures and retries, and the system enters a feedback loop where connection attempts pile up faster than they can be processed.

The triggering event can be small. A network hiccup that lasts 200ms is enough. A database failover that briefly drops connections is enough. A deploy that restarts the application tier is enough if the application opens connections eagerly on startup. The fix has two parts. First, connection establishment must be lazy: a worker should not open its connections at startup but only on the first request that needs one. This spreads the connection load over the first few seconds of traffic. Second, reconnects must include jitter: the application should retry with a randomized delay so that all workers do not retry at the same instant.

The deeper fix is to put a connection pooler in front of the database, so that the application can have many short-lived connections to the pooler while the pooler holds a small number of long-lived connections to the database. PgBouncer in transaction-pooling mode is the standard choice for PostgreSQL; it can absorb a thousand application connections and present ten or twenty backend connections to the database. The trade-off is that some PostgreSQL features (session-scoped state, prepared statements that survive across queries, advisory locks) do not work in transaction-pooling mode and require either statement mode or a session-pooling tier.

Idle connections are not free

The intuition that an idle connection costs nothing is wrong. Every idle connection holds memory on the database server (between 5MB and 20MB depending on configuration and recent query history). Every idle connection counts against max_connections. Every idle connection in PostgreSQL holds a backend process; on systems with many idle connections, the process count alone can degrade scheduling.

The configuration that matters is idle_timeout in your application's pool, and idle_in_transaction_session_timeout on the database side. The application's idle timeout should be short enough that connections close before the network's NAT/load-balancer kills them silently (typically 30 to 60 seconds for cloud environments). The database's transaction-idle timeout is a safety net for the case where an application opens a transaction, never commits or rolls back, and walks away. Without it, that connection can hold locks and block other writers indefinitely.

The max_lifetime parameter

The pool parameter that most teams overlook is max_lifetime, which forces a connection to be closed and recreated after some duration regardless of activity. This matters because long-lived connections accumulate state: session-level configuration, prepared statement caches, role memberships, and various bits of memory that PostgreSQL allocates on a per-session basis. A connection that has been open for a week has a significantly larger memory footprint than a fresh connection, even if it appears idle.

The other reason for max_lifetime is rolling deploys of the database tier. A connection opened against database server v1.2.0 will continue to use v1.2.0's wire protocol even after the server is upgraded to v1.3.0; in most cases this is fine, but in some cases there are subtle differences in behavior. Periodically recycling connections forces the application to renegotiate against the current database version, which catches compatibility issues earlier.

A reasonable max_lifetime is one to four hours for most applications. Setting it shorter than the idle_timeout is wasteful; setting it longer than a day means stale state accumulates.

Monitoring the right metrics

The metrics that predict connection-related incidents are not the ones most teams have on dashboards. Total connection count is necessary but not sufficient. The metric that matters most is the proportion of connections in the "idle in transaction" state, which indicates application bugs that hold transactions open longer than necessary. The next is the pool wait time, measured at the application's pool layer: the time a request spends waiting for a connection to become available. Pool wait time should be near zero in healthy operation; a sustained nonzero value means the pool is undersized for current load.

The third metric is connection creation rate, plotted alongside the active worker count. A spike in connection creation that is not accompanied by a spike in workers usually indicates a bug where the application is leaking connections (opening them and never returning them to the pool). This is a slow-burn pattern that can take weeks to develop and then suddenly manifests as exhaustion.

The four APIs we run at DocuMint, CronPing, FlagBit, and WebhookVault all use SQLite, which does not have a max_connections problem in the PostgreSQL sense. But the deeper lesson generalizes to any shared resource: the connection pool, the file descriptor count, the disk inode budget. Every resource has a limit, and the discipline of explicit budgeting is what prevents discovery during an incident.