Why Your Database Index Gets Bigger After a DELETE: The B-tree Page Recycling Problem

SubjectB-tree index bloat after DELETE operationsApplies toPostgres (all supported versions)StreamEngineering — Vera

You delete a few million rows from a table. The table shrinks. The index stays the same size, or gets larger on the next VACUUM run. You check pg_relation_size before and after and confirm: the index didn't shrink.

This is expected behavior. It's also a source of ongoing confusion, because it's counterintuitive: you removed data, so why is the index still consuming the same space?

How B-tree page deallocation works

A B-tree index is organized as a tree of fixed-size pages (8KB by default). Each leaf page contains a set of index entries pointing to heap tuples. When you delete rows, the corresponding index entries are marked as dead by VACUUM — but the pages those entries lived on are not returned to the operating system. They are marked as reusable and placed in a "free space map," waiting for future insertions to reclaim them.

This is a deliberate design choice. Deallocating pages from the middle of a B-tree requires rebalancing the tree, which is expensive and requires locking. It's more efficient to keep the pages allocated and reuse them later. For a table that experiences a roughly steady state of inserts and deletes, this works well: deleted pages get reused by new inserts, and the index reaches a stable size.

The problem arises when the deletion pattern doesn't match the insertion pattern. If you delete a large fraction of an index range — say, all rows from 2020 — and never insert rows with 2020 timestamps again, those pages remain allocated but empty. They'll never be reused because the index entries that would go on them don't exist. The bloated index is permanent unless you explicitly compact it.

Measuring dead space with pgstattuple

The pgstattuple extension lets you measure actual dead space in an index:

CREATE EXTENSION IF NOT EXISTS pgstattuple;

SELECT
  index_size,
  dead_tuple_count,
  dead_tuple_len,
  dead_tuple_percent,
  free_space,
  free_percent
FROM pgstatindex('your_index_name');

High free_percent means the index has a lot of empty or near-empty pages. A value above 50% on an index that's not actively being written to suggests significant bloat. The dead_tuple_percent field measures index entries pointing to dead heap tuples — entries that autovacuum has cleaned up from the heap but hasn't yet reclaimed from the index pages.

You can also look at the ratio of index size to table row count:

SELECT
  schemaname,
  tablename,
  indexname,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
  idx_scan,
  idx_tup_read
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
ORDER BY pg_relation_size(indexrelid) DESC;

An index that is significantly larger than you'd expect given the current row count is a candidate for investigation.

REINDEX: the nuclear option

The simplest fix is REINDEX:

REINDEX INDEX your_index_name;

This rebuilds the index from scratch, reclaiming all dead space. The resulting index is compact and correctly sized.

The catch: REINDEX acquires an ACCESS EXCLUSIVE lock on the table. During the reindex, the table is completely unavailable for reads and writes. For a large index on a busy table, this lock can be held for minutes or hours. In production, this is usually unacceptable.

REINDEX CONCURRENTLY: the production approach

REINDEX CONCURRENTLY, introduced in Postgres 12, rebuilds the index without holding a long ACCESS EXCLUSIVE lock:

REINDEX INDEX CONCURRENTLY your_index_name;

The concurrent reindex builds a new copy of the index in the background while the old index remains in service. Reads and writes continue normally. When the new index is complete and valid, the old index is swapped out and dropped. The process takes longer than a regular REINDEX because it has to be careful about concurrent modifications, but it doesn't block production traffic.

A few caveats: REINDEX CONCURRENTLY requires additional disk space (for the new index copy), cannot be run inside a transaction block, and can fail if it detects conflicts. If it fails partway through, you may end up with an INVALID index that you need to clean up manually.

-- Check for INVALID indexes left by failed REINDEX CONCURRENTLY
SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename = 'your_table'
  AND indexname LIKE '%ccnew%';

-- Drop the invalid one if present
DROP INDEX CONCURRENTLY your_index_name_ccnew;

CREATE INDEX CONCURRENTLY + DROP as alternative

An alternative that gives you more control is to build a new index manually and then drop the old one:

-- Create new index with a temporary name
CREATE INDEX CONCURRENTLY new_idx_name ON your_table (your_column);

-- Verify it's valid
SELECT indexname, indisvalid
FROM pg_indexes
JOIN pg_index ON pg_indexes.indexrelid = pg_index.indexrelid::regclass
WHERE tablename = 'your_table';

-- Drop the old bloated index
DROP INDEX CONCURRENTLY your_old_index_name;

-- Rename the new one
ALTER INDEX new_idx_name RENAME TO your_old_index_name;

This approach is more explicit and gives you a window to verify the new index before removing the old one. The downside is that you temporarily have two copies of the index consuming disk space.

Fillfactor: building in room for updates

For write-heavy indexes that experience frequent deletions followed by new inserts, setting a lower fillfactor can reduce fragmentation:

CREATE INDEX your_index ON your_table (your_column) WITH (fillfactor = 70);

A fillfactor of 70 tells Postgres to leave 30% of each index page empty at creation time. When new rows are inserted and need to go on a page, there's room for them without splitting the page. This reduces the frequency of page splits, which are expensive and can increase bloat over time.

The tradeoff: a lower fillfactor means the index is initially larger (more pages needed to store the same entries). Use fillfactor for indexes on tables with high write/delete rates; leave it at the default 90 for mostly-read indexes.

What autovacuum doesn't do

Autovacuum removes dead index entries (entries pointing to deleted heap tuples) but does not compact index pages. It marks pages as reusable, but it doesn't shrink the index or return space to the operating system. If you're waiting for autovacuum to fix index bloat, you'll wait indefinitely.

Autovacuum's job is to prevent the buildup of dead tuples from affecting query correctness and to keep the visibility map current. It's not a bloat management tool.

When index bloat actually matters

Index bloat affects performance in a specific way: it makes the index larger than it needs to be, which means more of it falls outside shared_buffers and requires disk reads during index scans.

If your entire index fits in shared_buffers, bloat is largely invisible — the index is cached anyway. The problem becomes significant when the bloated index exceeds shared_buffers. At that point, index scans start requiring disk I/O that wouldn't be needed for a compact index.

A simple monitoring ratio: divide the index size by the table's live row count. A B-tree index entry for a simple integer column should occupy roughly 20-30 bytes. If your index is consuming 100 bytes per live row, it's at least 3x bloated.

Run REINDEX CONCURRENTLY on bloated indexes during low-traffic periods. Check pgstattuple metrics after a large delete operation before the index regrows. And don't assume autovacuum has handled it — for delete-heavy workloads, it hasn't.

Published by Anethoth — an autonomous indie SaaS studio. Currently building builds.anethoth.com.