Backfilling Production Tables: How to Add Data Without Locking the Database
Adding a column with a default value to a 50-million-row table the naive way will lock production for the duration of the rewrite. The patterns that work avoid the rewrite, do the backfill in chunks outside transactions, and treat the migration as several deploys instead of one.
The shape of a database backfill incident is familiar. The migration looks innocent in the pull request: add a column, default it to zero, the unit tests pass. In staging the migration runs in two seconds against a table with a few thousand rows. In production the same migration locks the table for fifteen minutes against fifty million rows, every read and write blocks behind it, and the engineer who ran it learns several things about production traffic the hard way.
Backfilling is one of those operations where the failure mode is invisible in any environment smaller than production. The technique that works at thousands of rows does not work at millions, and the technique that works at millions does not work at hundreds of millions. We have run a fair number of backfills across DocuMint, CronPing, FlagBit, and WebhookVault at different scales, and the patterns that hold up consistently are about avoiding the rewrite, breaking the work into chunks, and treating a single logical schema change as several physical deploys.
Why ALTER TABLE ADD COLUMN with a default rewrites the whole table
In PostgreSQL before version 11, ALTER TABLE ADD COLUMN ... DEFAULT 'something' required rewriting every row in the table to physically include the new column value. This took an ACCESS EXCLUSIVE lock on the table for the duration, which blocks every other operation including plain SELECT. On a multi-gigabyte table this is many minutes of downtime.
PostgreSQL 11 introduced an optimization for constant defaults: the new column is added to the metadata, the default value is stored in pg_attribute, and existing rows are not rewritten. The default is filled in virtually when the row is read. This is a substantial improvement and covers the common case of adding a column with a simple default. The lock is still ACCESS EXCLUSIVE but is now brief enough to be a non-event.
The optimization does not apply if the default is non-constant (a function call like now(), a subquery, or any expression that varies per row). It also does not apply if you add a NOT NULL constraint without a default in the same statement, because Postgres needs to verify the constraint against every row. The full table rewrite still happens in these cases, and the workaround is to break the operation into stages: add a nullable column without a default, backfill it in chunks, then add the NOT NULL constraint in a separate operation.
The two-phase NOT NULL pattern
The canonical pattern for adding a NOT NULL column to a busy table is four steps over several deploys. First, add the nullable column with no default. This is a metadata-only change in modern Postgres and acquires ACCESS EXCLUSIVE only briefly. Second, update application code to write the new column on every insert and update. Third, backfill the existing rows in chunks. Fourth, add the NOT NULL constraint using NOT VALID then VALIDATE CONSTRAINT in two separate steps.
The NOT VALID trick is one of the most useful tools in the operational Postgres toolkit. ALTER TABLE ... ADD CONSTRAINT ... NOT VALID adds a constraint that applies to new rows but does not verify it against existing rows. This is a fast metadata-only operation. Then ALTER TABLE ... VALIDATE CONSTRAINT ... verifies the constraint against existing rows, but does so with only a SHARE UPDATE EXCLUSIVE lock, which allows concurrent reads and writes. The whole operation, including the validation scan, can run against active production traffic.
Chunked backfills
The backfill itself — the step that fills in values for existing rows — is the part most teams get wrong. The naive approach is a single UPDATE statement that hits every row needing the backfill. On a fifty-million-row table this takes minutes, generates an enormous amount of WAL, locks every updated row for the duration of the transaction, and blocks autovacuum from cleaning up. It is the worst possible shape of update.
The right shape is chunked: process N rows at a time in their own transaction, commit, briefly pause, repeat. The chunk size depends on your write throughput; we typically use 1,000 to 10,000 rows per chunk, with the upper end for tables that have few indexes and the lower end for tables that are heavily indexed (because every index needs to be updated for every row).
The query shape for chunked update is a CTE that selects the next chunk by primary key and updates only those rows:
WITH batch AS (
SELECT id FROM big_table
WHERE new_column IS NULL
ORDER BY id
LIMIT 5000
)
UPDATE big_table SET new_column = compute_new_value(...)
WHERE id IN (SELECT id FROM batch)
RETURNING id;The returned IDs let the script log progress and resume from where it left off if interrupted. The ORDER BY id ensures forward progress; without it, the LIMIT could repeatedly select the same rows.
Between chunks, a small sleep (a few hundred milliseconds) gives autovacuum room to clean up dead tuples produced by the updates. Without the pause, dead tuples accumulate faster than autovacuum can clean them, the table bloats, and the backfill gets slower as it progresses. We have seen unpaused backfills double the on-disk size of a table by the time they finish.
The dual-write window
While the backfill is running, new rows being inserted by the application must also get the new column value. This is what step two above accomplishes: deploy application code that writes the new column on every insert and update before starting the backfill. Now the backfill is just catching up the historical rows; new rows are already correct.
The discipline that this enforces is that the schema change cannot be a single deploy. You need at least three: deploy the migration that adds the nullable column, deploy the application code that writes it on new rows, then run the backfill (which may be a separate one-off task rather than a code deploy). After the backfill is verified complete, a fourth deploy can add the NOT NULL constraint and remove any application code that was written to handle null values during the transition.
This is a substantial increase in process overhead compared to a single migration. It is also the reason that production backfills do not cause outages. The same logic applies to any schema change that affects existing data: column renames, type changes, constraint additions, default changes. The two-phase pattern with explicit dual-write windows is the operational price of zero-downtime evolution.
Idempotency and resumability
Long-running backfills get interrupted. The connection drops, the script crashes, the operator hits Ctrl-C, the database has a brief failover. The backfill script needs to be resumable, which means it needs to be idempotent: running it twice should produce the same result as running it once.
The natural way to achieve this is to predicate the update on the column still being null (or whatever sentinel indicates the row has not been backfilled yet). The WHERE new_column IS NULL clause in the query above is what makes the backfill idempotent: a row that has already been backfilled is not selected on the next iteration.
Logging is also essential. The script should log the last processed primary key (or the chunk number, or both) after every chunk, so that an interruption can be diagnosed and the script can be restarted from a known point. The log should be persistent (a small table in the database, not just stdout), so that operators investigating the next morning can see exactly how far the backfill got and which rows it touched.
The verification step
The last step is verifying that the backfill is actually complete before moving to the VALIDATE CONSTRAINT step. The verification is a simple count: SELECT count(*) FROM big_table WHERE new_column IS NULL should return zero. If it does not, the backfill is not done, and adding the NOT NULL constraint will fail with a confusing error message blaming a specific row that happens to have been inserted between the backfill end and the constraint addition.
The verification should also include a sanity check on the values: pick a sample of rows and verify that the new column has the expected relationship to the other columns. This catches the case where the backfill script had a bug and produced systematically wrong values. The verification is cheap and the cost of a silently corrupted backfill is high; the cost-benefit always favors verifying.
The deeper observation
A backfill is one of the operations where the difference between a small system and a large system is qualitative rather than quantitative. The technique that takes five lines of SQL at one scale takes a hundred lines of carefully-orchestrated script at another. The reason is not that the database has changed — Postgres works the same way at every scale — but that the relationship between the operation and the production traffic has changed. At small scale the operation is faster than any plausible incident response; at large scale the operation is the incident response.
The patterns that hold up are mostly about respecting that asymmetry: avoid full-table rewrites, work in chunks, give the database room to clean up, design the schema change as a sequence of deploys rather than one, and verify the result. None of these patterns are clever. They are the operational discipline that turns a potentially catastrophic schema change into a non-event, and the teams that have them have them because they have learned the lesson the hard way at least once.