Database Statistics and the Query Planner: Why ANALYZE Matters More Than You Think
The query planner doesn't run your query — it estimates which plan would be fastest based on statistics about your data. When the statistics drift from reality, the planner picks plans that look fast on paper and run terribly in practice.
The thing nobody tells you about query planners is that they don't really know what's in your tables. They guess, based on statistics that were collected at some point in the past, possibly under load conditions that no longer apply. The estimates are usually close enough that the right plan gets picked. But when the statistics drift far enough from reality — because a bulk load just changed the row count, because a column's value distribution shifted, because the table grew faster than autovacuum could keep up with — the planner picks plans that look fast on paper and run terribly in practice. Understanding what statistics the planner uses, when they get refreshed, and what it takes to keep them honest is one of those database-engineering disciplines that pays for itself the first time it saves a production incident.
The patterns in this post apply across the four products in our studio — DocuMint, CronPing, FlagBit, and WebhookVault — and to any service backed by a relational database that uses cost-based query planning, which is essentially all of them.
What the planner is actually doing
When you submit a query, the planner enumerates a set of candidate execution plans and estimates the cost of each one. The cost is a unitless number that represents the planner's guess at how much work the plan will do — combining estimates of CPU time, page reads, sort operations, and join methods. The planner picks the plan with the lowest estimated cost and hands it to the executor, which runs it.
The cost estimate depends on row counts. To estimate how many rows a WHERE clause will return, the planner looks at the column's statistics: the number of distinct values, the most common values and their frequencies, the distribution histogram, the proportion of nulls. To estimate how many rows a join will produce, the planner looks at the join keys' statistics on both sides. To decide whether an index scan beats a sequential scan, the planner estimates how many rows the index would return and compares that to the cost of reading the whole table.
The estimates are usually approximate. The histograms have limited resolution. The most-common-values list captures the top hundred or so. Correlations between columns are mostly invisible to the planner unless extended statistics are explicitly created. The planner assumes uniform distribution within histogram buckets, which is usually wrong but usually wrong by a small enough margin that it doesn't matter.
How statistics get collected
PostgreSQL collects statistics through autovacuum's analyze pass, which is triggered when a table accumulates more changes than a configurable threshold — by default, 10% of the row count plus 50 rows. The default sample size is 30,000 rows times the per-column statistics target, which defaults to 100. For a table with billions of rows, the sample is a tiny fraction of the data, but the law of large numbers means the histograms are usually accurate enough.
The autovacuum analyze pass runs in the background and uses a small fraction of the database's resources. For most tables, this is enough to keep statistics current. The cases where it's not enough — where statistics drift far enough from reality to cause planner regressions — are the cases worth knowing about.
The first case is bulk loads. When you insert a million rows into a previously empty table, the statistics still show zero rows until autovacuum gets around to analyzing it. In the meantime, the planner thinks the table is empty and picks plans suited to that, including nested loop joins that would be catastrophic on the actual data. The fix is to run ANALYZE manually after large data loads.
The second case is recently-created indexes. When you create an index on a large table, the planner doesn't immediately know the index's cardinality and selectivity. The fix is to run ANALYZE on the table after the index is created.
The third case is correlated columns. The planner assumes columns are independent unless told otherwise. If you have a query that filters on city='New York' AND state='NY', the planner multiplies the selectivities, even though the second filter selects almost none of the rows that the first didn't already select. The fix is CREATE STATISTICS to teach the planner about the correlation.
The fourth case is data with skew that exceeds the most-common-values list. If 99% of your rows have one tenant_id and the remaining 1% are distributed across thousands of others, the planner sees the dominant tenant and picks plans optimized for it, which are wrong for the long tail. The fix is to increase the per-column statistics target with ALTER TABLE ... ALTER COLUMN ... SET STATISTICS.
How to spot bad estimates
The single most useful diagnostic is EXPLAIN ANALYZE, which runs the query and reports both the estimated row count and the actual row count for each plan node. When the estimate and the actual diverge by more than an order of magnitude, the planner is working with bad information. The plan node that diverges most is usually the one to investigate.
The auto_explain extension can log the plans of any slow query automatically, which is the highest-leverage configuration change for a database that's running production traffic. Set auto_explain.log_min_duration to a few seconds and auto_explain.log_analyze to true, and slow queries will appear in the log with their full plan, including the estimate-vs-actual divergence.
The pg_stat_user_tables view shows the last_analyze and last_autoanalyze timestamps for each table. Tables that haven't been analyzed in days under continuous write load are candidates for explicit ANALYZE.
The pg_statistic_ext_data view, for tables with extended statistics defined, shows what correlations the planner is aware of. Tables with skewed correlated columns and no extended statistics are silent contributors to bad plans.
What to actually do
The discipline that holds up is to treat statistics as a thing the application is responsible for, not just something the database does in the background. After bulk loads, run ANALYZE. After creating an index, run ANALYZE. When you write a query that filters on multiple correlated columns, define extended statistics. When you have a table with extreme skew, increase the statistics target on the affected columns.
The diagnostic discipline is to enable auto_explain in production with a sensible threshold, monitor for plans where estimated and actual rows diverge by more than ten times, and treat those divergences as a signal that some statistics need attention. Over time, this discipline produces a database that picks good plans most of the time, which is the goal.
The deeper observation
Cost-based query planning is one of the great engineering achievements of the database community. The same SQL query against the same data can run in microseconds or in hours depending on the plan; the planner's job is to make the difference invisible to the developer. The price of the abstraction is that the developer has to maintain the conditions under which the planner works well, which mostly comes down to keeping the statistics current. The teams that take statistics seriously have predictable database performance; the teams that don't have a recurring genre of incident where a query that ran fine yesterday now takes ten minutes, and the fix turns out to be ANALYZE.