Postgres auto_explain: Surfacing Slow Queries Without Application Instrumentation

Most slow-query investigations begin with reconstructing what happened from incomplete information. The slow query showed up in a metric or a customer report. The dashboard logs say the request took longer than usual. The application code does not log which specific SQL was generated by the ORM or which parameters were bound. The database's pg_stat_statements view shows that one normalized query is consuming disproportionate time but does not show the plan for any particular execution. The investigator runs EXPLAIN ANALYZE on a representative case and either reproduces the slowness or does not, often does not, because the production conditions involved a parameter the investigator did not guess or a cache state that no longer exists.

The auto_explain extension closes this gap. It logs the actual execution plan of every query that exceeded a configurable threshold. The information captured is the plan, the parameter values, the buffer counters, and the timing. The investigator working a slow-query report two hours later has the actual evidence rather than a reconstruction.

What auto_explain does

The extension hooks into the Postgres query executor. When a query completes, if its execution time exceeded the configured threshold, the extension generates the equivalent of EXPLAIN (ANALYZE, BUFFERS) output and writes it to the server log. The hook runs after execution completes, so the plan reflects what actually happened including chosen indexes, row counts encountered, and time spent in each node.

The extension is bundled with Postgres but not loaded by default. Loading requires shared_preload_libraries = 'auto_explain' in postgresql.conf and a server restart. Once loaded, the extension exposes configuration parameters that can be changed at runtime via ALTER SYSTEM without restart.

The configuration knobs that matter

The threshold parameter auto_explain.log_min_duration controls which queries get logged. Setting it to 1000 means queries taking longer than one second. The right threshold depends on the application's typical query distribution. A common starting point is the p95 expected duration, which catches outliers without flooding logs with normal-but-slow background queries.

The auto_explain.log_analyze parameter when set to on includes actual execution time and row counts. Without it, the log shows only the planner's estimates. The actual numbers are the diagnostic value of the extension; the estimates are already available via plain EXPLAIN.

The auto_explain.log_buffers parameter when set to on includes buffer hit/read counters in the log. These are essential for diagnosing whether a slow query was slow because it did unexpected disk IO or because it executed in CPU on cached data. The two cases have different remediations.

The auto_explain.log_timing parameter when set to on includes per-node timing. This requires high-resolution timer calls per node and can add 10-20 percent overhead. Most production deployments leave timing off and accept node-level totals.

The auto_explain.log_nested_statements parameter when set to on captures plans for queries inside functions and procedures. Without it, only the outer call is logged.

The auto_explain.sample_rate parameter introduced in Postgres 13 allows sampling rather than logging every qualifying query. Setting it to 0.1 captures one in ten qualifying queries. This is the right pattern when the slow-query threshold is low enough that logging every execution would overwhelm log volume.

The overhead question

The overhead of auto_explain has three components. The first is the threshold check itself, which is essentially free. The second is the plan-formatting cost on qualifying queries, which is small but proportional to plan complexity. The third is the timer calls if log_timing is on, which can be substantial on systems with slow clock-source resolution.

On modern Linux with TSC clock source, timer calls take tens of nanoseconds and the overall overhead is under one percent for typical workloads. On systems with HPET or other slow clock sources, the overhead can be five to ten percent and timing should be disabled. The clock_gettime performance can be tested independently before deciding.

The log volume question

Every qualifying query produces a multi-line log entry of typically 1-5KB. At one slow query per second the volume is reasonable. At one hundred per second the volume is problematic. The mitigation options are raising the threshold, enabling sampling, or shipping logs to a backend with structured-log search capability.

The structured-search backend is the right answer for any team that runs auto_explain in production. Grep-friendly log files become unwieldy quickly when each entry is dozens of lines of indented plan output. Tools that parse Postgres log format and index plan structure pay back the integration cost within weeks.

What auto_explain does not capture

The extension captures plans for queries the database executed. It does not capture queries that timed out at the client, queries that failed before execution, queries that were cancelled, or transactions that were rolled back without completing the slow operation. The slow-query symptoms in those cases require different tooling.

The extension also does not capture lock waits or connection waits. A query that took thirty seconds because it waited twenty-nine seconds for a row lock shows up with execution time thirty seconds but the plan does not explain the wait. The wait diagnosis requires pg_stat_activity sampling or pg_locks investigation at the time of the wait.

The patterns that compound value

Running auto_explain continuously in production with a low threshold and sampling gives the team a steady supply of evidence about what actually runs slowly. The diagnostic pattern shifts from "guess what was slow" to "search the log for the request_id and read the plan."

Combining auto_explain output with application-side trace IDs in the log makes the join between application observability and database evidence straightforward. The right pattern is logging the trace ID as a Postgres session-level application_name or using SET application_name at the start of each request, which propagates to the auto_explain log entries.

Running auto_explain in staging during load tests reveals plans that production has not exhibited but might. The pattern catches index-not-used cases driven by query rewrites or statistics changes before they reach customers.

Our use

Our four products run on SQLite where auto_explain has no direct analog. The closest equivalent is SQLite's EXPLAIN QUERY PLAN combined with application-level slow-query logging in our shared FastAPI middleware. The middleware records request duration and the executed SQL for any request exceeding 500ms, which captures the same information for queries that complete and the slow execution pattern.

The Postgres migration plan for any of the four products includes auto_explain configuration as launch infrastructure rather than later addition. The threshold starts at 1000ms with buffer logging enabled and timing disabled and sample_rate 1.0 because the four products are individually small enough that volume is not yet a concern. The threshold will lower over time as we learn the typical query duration distribution.

The deeper observation

The deeper observation is that database observability is consistently underused relative to its operational value. The features exist, are stable, and have small overhead, but require configuration discipline and log-pipeline investment that teams defer until an incident makes the investment urgent. The pattern across pg_stat_statements and pg_stat_activity and auto_explain is that turning these on at launch pays back the first time a slow query needs investigation and the team has the evidence rather than the reconstruction.

Our products: DocuMint (PDF invoice generation API), CronPing (cron job monitoring with status pages), FlagBit (feature flags API for modern teams), and WebhookVault (webhook capture and replay) put these patterns into production.