Postgres VACUUM and MVCC: How Database Garbage Collection Actually Works
PostgreSQL does not update rows in place. Every UPDATE writes a new row version and leaves the old one behind, and VACUUM is the background process that cleans up. Most teams discover this when autovacuum falls behind and the database starts getting slower without an obvious cause.
The PostgreSQL data model is one of the more surprising things to learn about a database that most engineers think they understand. UPDATEs do not modify rows in place. Instead, every UPDATE writes a new copy of the row with a new transaction ID and leaves the old copy behind as a dead tuple. Eventually a background process called autovacuum sweeps through the table, finds tuples that no live transaction can see, and reclaims their space. The mechanism is called multi-version concurrency control (MVCC), and understanding how it interacts with VACUUM is one of the higher-leverage pieces of operational knowledge for anyone running PostgreSQL in production.
This post covers what MVCC actually does, why dead tuples accumulate, what VACUUM is responsible for, and the failure modes that show up when autovacuum falls behind. The patterns apply across the four products in our studio — DocuMint, CronPing, FlagBit, and WebhookVault — though we currently use SQLite where these specific concerns do not apply. The migration to PostgreSQL is a planned future step, and the operational model is worth understanding before that migration earns its weight.
What MVCC actually does
When a transaction reads a row in PostgreSQL, it does not see the latest value. It sees the version of the row that was visible at the moment the transaction started. This is the central guarantee that makes long-running reports possible without blocking writes: a report that takes ten minutes to run sees a consistent snapshot of the database as it existed when the report started, regardless of what happens to the underlying tables during those ten minutes. The same mechanism makes the difference between read committed and repeatable read isolation levels: in read committed, each statement gets a fresh snapshot; in repeatable read, the snapshot is taken at the start of the transaction and held for its lifetime.
The implementation is straightforward in concept. Every row has hidden columns called xmin (the transaction ID that created the row) and xmax (the transaction ID that deleted or replaced the row, or zero if the row is current). Every transaction has its own transaction ID and a list of which other transactions are still in progress. When a query reads a row, the engine checks whether the row's xmin is committed and visible from this transaction's perspective, and whether xmax is uncommitted or invisible. If both checks pass, the row is part of this transaction's view; otherwise it is skipped.
The consequence is that an UPDATE does not modify the existing row. It writes a new row with the same primary key but a different physical location, sets xmax on the old row to the current transaction ID, and sets xmin on the new row to the same transaction ID. Both rows now exist on disk. Transactions that started before the UPDATE committed see the old row; transactions that start after see the new row. The old row sticks around until no transaction could possibly need it.
Why dead tuples accumulate
A dead tuple is a row version that no live transaction can see. The row was deleted or replaced by an UPDATE, all transactions that could have seen the old version have committed or aborted, and the space the row occupies is no longer reachable from any query. PostgreSQL does not eagerly reclaim this space because doing so would require scanning the active transaction list on every UPDATE, which would serialize writes and destroy concurrency. Instead the space stays as a dead tuple until a background process visits the table.
The accumulation rate depends entirely on the workload. A table that is mostly read with occasional inserts has almost no dead tuples. A table that is updated frequently — a sessions table where last_activity gets updated on every request, an idempotency-key table where a status column gets updated as the operation progresses, a counter table where a count column increments — accumulates dead tuples in proportion to its update rate. A workload that updates every row in a table once per hour will accumulate dead tuples equal to the table size every hour.
The visible effect of accumulated dead tuples is bloat. The table on disk is much larger than the live data it contains, sequential scans take longer because they have to walk past the dead tuples, indexes get larger because they have to point to all the row versions, and cache hit rate drops because the working set no longer fits in RAM. A table that should be one gigabyte can grow to ten gigabytes if updates are frequent and VACUUM falls behind.
What VACUUM is responsible for
VACUUM has three jobs. First, it scans tables for dead tuples, marks the space they occupy as reusable, and updates the visibility map so future scans can skip pages that are entirely visible. Second, it updates statistics that the query planner uses to choose execution strategies. Third, and most operationally critical, it advances the relation's frozen-transaction-ID horizon. PostgreSQL's transaction IDs are 32-bit integers, which would wrap around every four billion transactions. To prevent this from causing data loss, VACUUM rewrites very old rows with a special "frozen" marker that means the row is visible to all current and future transactions regardless of XID comparison. A relation that has not been VACUUM-ed for a long time accumulates rows that are at risk of falling off the back of the XID range.
Standard VACUUM does not return space to the operating system. It marks pages as reusable so future inserts and updates can fill them, but the file size on disk does not shrink. VACUUM FULL rewrites the table from scratch, copying live tuples to a new file and dropping the old one, which does shrink the on-disk size. VACUUM FULL takes an exclusive lock on the table and is generally not the right answer for production tables; the right answer is to keep autovacuum running aggressively enough that the table never accumulates enough bloat to need a rewrite.
How autovacuum decides what to vacuum
The autovacuum daemon wakes up periodically (every minute by default) and decides which tables need to be vacuumed. The decision is based on two thresholds: a per-table tuple threshold (autovacuum_vacuum_threshold, default 50) and a per-table fraction (autovacuum_vacuum_scale_factor, default 0.2). A table is eligible for vacuum when the number of estimated dead tuples exceeds threshold + scale_factor * table_size. For a 1000-row table, this is 50 + 0.2 * 1000 = 250 dead tuples. For a 10-million-row table, it is 50 + 0.2 * 10000000 = 2 million dead tuples.
The default scale factor of 0.2 means a large table accumulates a lot of dead tuples before autovacuum runs. For tables that update frequently, the right configuration is usually a much smaller scale factor — 0.01 or 0.05 — so autovacuum runs more often and never falls far behind. The setting can be applied per-table via ALTER TABLE, which is the right pattern for the few tables that are write-heavy without applying the more aggressive setting globally.
The failure modes
The first failure mode is autovacuum being throttled too much. The default cost-based delay (autovacuum_vacuum_cost_delay) limits how much I/O autovacuum can do per cycle, which on modern hardware is far more conservative than necessary. Tuning this down — or to zero on hardware that can absorb the I/O — lets autovacuum keep up with workloads that overwhelm the default settings.
The second failure mode is long-running transactions. Every transaction holds a snapshot, and that snapshot prevents VACUUM from reclaiming any tuple that was visible when the transaction started. A connection that opens a transaction and then sits idle for hours — a worker that hangs on a network call, an analyst who started a query in psql and went to lunch — prevents VACUUM from cleaning up across the entire database. The pg_stat_activity view shows backend_xmin and backend_xid for active connections; the oldest backend_xmin is the wall that VACUUM cannot get past. The standard mitigation is the idle_in_transaction_session_timeout setting, which kills connections that hold a transaction open without doing work.
The third failure mode is the wraparound emergency. If VACUUM falls so far behind that a relation's transaction IDs are about to wrap, PostgreSQL enters a state where it refuses new transactions on that relation until VACUUM catches up. This is not a graceful degradation — it is a hard stop, and recovering requires running VACUUM with the right options under load. The pg_database view shows datfrozenxid, which is the oldest XID that has been frozen across the database; alerting on this metric well before it approaches the wraparound point is a basic operational discipline.
The five operational signals
The signals worth monitoring are: pg_stat_user_tables.n_dead_tup divided by n_live_tup as a per-table dead-tuple ratio (alert when above 20% on important tables); pg_stat_user_tables.last_autovacuum as the most recent autovacuum time (alert when older than expected for write-heavy tables); pg_stat_activity for the oldest backend_xmin (alert on idle-in-transaction connections older than a few minutes); the table file size on disk vs the size estimated from row count and average row size (alert on large bloat ratios); and pg_database.datfrozenxid age relative to the autovacuum_freeze_max_age setting (alert when above 75%).
The deeper observation
The MVCC model trades cleanup-as-you-go for cleanup-in-the-background, and the trade-off is usually worth it: writes do not block on cleanup, readers get consistent snapshots, and the average case has good performance. The cost is that the database has a background maintenance process that needs to keep up with the workload, and when it falls behind, the symptoms show up as slow queries and growing disk usage rather than as obvious errors. Most teams discover MVCC by debugging a slow database and finding that a frequently-updated table is ten times larger on disk than its live data. Knowing the model in advance and configuring autovacuum aggressively for write-heavy tables is one of the higher-leverage things a backend team can do before the symptoms start.