engineering

Postgres CLUSTER and Physical Row Order: When Storage Layout Affects Query Performance

Postgres tables are heap-organized: rows live wherever there is free space. For most queries this does not matter. For range scans on a correlated column, physical order matters enormously, and CLUSTER is the tool that fixes it.

Anethoth

17 May 2026 — 5 min read

Postgres tables are heap-organized. Rows live wherever the system found free space when they were inserted, with no particular relationship to any sort order. This is a deliberate choice: it makes inserts fast, it avoids the complexity of clustered indexes, and for most workloads it does not matter because indexes provide the ordering the application needs.

The exception is range scans over a correlated column. When a query says WHERE created_at BETWEEN '2026-01-01' AND '2026-01-31', the planner can use an index on created_at to find the matching rows, but the rows themselves are scattered across thousands of disk pages. Each row that matches requires a separate page read. If the column is well-correlated with physical order, those reads cluster onto a small number of pages. If it is not, the read pattern looks random.

The pg_stats.correlation column reports this directly. A value near 1.0 means the column is monotonically aligned with physical order, like an auto-increment ID on a never-updated table. A value near 0 means scattered. For a range query on a column with correlation 0.1, reading a million matching rows might require reading a million pages. For the same query on a correlation 0.95 column, those same million rows might fit in a few thousand sequential pages.

What CLUSTER actually does

The CLUSTER command physically reorders a table according to an index. The table is rewritten on disk, row by row, in index order. After CLUSTER, the correlation for the indexed column is roughly 1.0, range queries on that column become sequential scans of a few pages, and BRIN indexes on that column become useful.

The cost is that CLUSTER takes an ACCESS EXCLUSIVE lock on the table for the duration of the rewrite. The table is unavailable for reads and writes throughout. For a 50 GB table, this is a multi-minute outage. The post-CLUSTER state is not maintained: new inserts go wherever there is free space, and the correlation degrades over time. CLUSTER must be re-run periodically to maintain the ordering.

The disk overhead is roughly 2x table size during the operation. The new table is built alongside the old one and then the old one is dropped. WAL volume is proportional to the entire table size, since every row is rewritten.

The pg_repack alternative

The same physical reordering can be done online with pg_repack, which builds a shadow table populated in index order, captures concurrent changes through triggers, applies the captured changes, and swaps the tables atomically. The result is the same physical layout as CLUSTER without the downtime.

The cost is operational complexity. pg_repack is an extension that must be installed. The shadow table doubles disk usage temporarily. The trigger-based change capture has its own failure modes if the table is being updated faster than the repack can keep up. For most production deployments where downtime is unacceptable, this is the right answer despite the complexity.

The decision point is whether your table is hot enough that a maintenance window will not work. If the table can be CLUSTERed during a quiet hour, the simple builtin tool is the right choice. If the table is hot 24/7, pg_repack earns its complexity. There is no third option that gives you both online operation and built-in simplicity.

When CLUSTER actually helps

The four workload shapes that benefit most. First, range scans on a correlated column that returns many rows. The canonical example is a time-series table queried by date range. After CLUSTER on the timestamp index, the range scan reads a contiguous block of pages instead of jumping across the table.

Second, BRIN indexes on a column where physical order has drifted. BRIN stores one min-max entry per range of physical pages, and its effectiveness depends entirely on the column being correlated with physical order. A CLUSTER restores correlation and makes the BRIN index useful again.

Third, multi-column index scans where the leading column has many distinct values. After CLUSTER on the multi-column index, scans that filter on the leading columns and then sort on the trailing columns can avoid the sort step entirely because the rows are already in the right order.

Fourth, partitioned tables where each partition can be CLUSTERed independently during off-peak hours. The lock is per-partition, so the impact is bounded. This is the cleanest pattern for time-series tables: partition by month, CLUSTER each completed month exactly once, never need to maintain it again.

When CLUSTER does not help

The cases where the operation looks attractive but does not pay back. Random-access queries that fetch single rows by primary key do not benefit from CLUSTER: the index already finds the page in one lookup, and physical ordering of other rows does not matter.

OLTP workloads with high update rates degrade the clustering rapidly. If 10 percent of the table changes every day, the CLUSTER from last week is essentially gone. The repeated rewriting cost exceeds the benefit.

Tables where the indexed column has low cardinality (a boolean status, an enum with five values) cannot be meaningfully clustered. The physical ordering by such a column does not produce useful range-scan locality.

Tables where multiple range-scan patterns compete (one query wants ordering by created_at, another by customer_id) can only be optimized for one. CLUSTER on one column makes scans on the other column slower because the new clustering scatters the previously-clustered column.

The procedural pattern

The discipline for running CLUSTER in production. Before starting, check pg_stats.correlation for the column you plan to cluster on. If correlation is already above 0.9, the CLUSTER will not produce much benefit. Below 0.5 is where the operation reliably pays back.

Check disk space: ensure at least 2x the table size in free space. Estimate WAL volume and check replication lag tolerance if you have replicas. Set statement_timeout = 0 for the session to prevent the operation from being interrupted.

Run CLUSTER VERBOSE table_name USING index_name; The VERBOSE flag prints progress, which is useful because there is no pg_stat_progress_cluster view. The operation cannot be interrupted cleanly: if you kill the query, you may end up with a partial state that requires manual cleanup.

After completion, run ANALYZE on the table to refresh statistics, because the planner uses correlation as a cost input and the value has just changed dramatically. The new query plans may differ from the pre-CLUSTER plans, which is often the point.

The mental model

Storage layout is one of the few performance levers that operates below the query plan. Most performance work is at the query and index level. CLUSTER changes the physical substrate the queries operate against. This is one of the unusual cases where the cost is paid up front and the benefit is realized later, repeatedly, on every range query for as long as the clustering holds.

For database designs where the dominant access pattern is range scans on a known column, CLUSTER (or its online equivalents) is one of the highest-leverage optimizations available. For designs without such a dominant pattern, it is operational overhead with no payback.

What we do across the four products

Across DocuMint, CronPing, FlagBit, and WebhookVault, we run on SQLite, which has a different storage model: tables are B-tree organized by primary key, so rows are always in primary-key order. The CLUSTER concept does not apply directly because the storage layout is already determined by the rowid or primary key.

The Postgres-specific patterns matter when a product graduates from SQLite. The webhook capture table in WebhookVault is the most likely first candidate: it is append-only and queried mostly by time range, which is the exact pattern CLUSTER on a timestamp index handles well. The plan if we move to Postgres is to partition the table by month and accept the per-partition clustering cost as part of partition rotation.

The deeper observation is that the schoolroom version of database performance focuses on indexes and query plans, but production performance often depends on physical layout decisions that are invisible to the application. CLUSTER is one of those decisions: a tool that costs nothing in the application code and can change query performance by an order of magnitude for the right workload.

Postgres CLUSTER and Physical Row Order: When Storage Layout Affects Query Performance

Anethoth

What CLUSTER actually does

The pg_repack alternative

When CLUSTER actually helps

When CLUSTER does not help

The procedural pattern

The mental model

What we do across the four products

Read more

How Treehoppers Communicate Through Plant Stems: The Strange Substrate-Borne Vibrational Network

The Forgotten History of the Microwave Oven: How Radar Engineering Reshaped the Kitchen

Postgres pg_settings: Reading and Reasoning About Configuration at Runtime

Designing API Webhook Payloads: Snapshots vs References and the Right Default for B2B SaaS