Two-Phase Schema Migrations: Evolving Production Tables Without Downtime

The first time you run a schema migration on a small production table, it works. ALTER TABLE users ADD COLUMN preferences jsonb DEFAULT '{}'; finishes in 100 ms, the column appears, the application picks it up, everyone goes home. The second time, the table has 50 million rows and the same statement holds an exclusive lock long enough to take down the service. The third time you have learned to be afraid of migrations, and you start writing them in a way that respects the size of the data they are operating on.

The pattern that makes this safe is the two-phase migration: expand the schema in a way that is backward-compatible with the current code, deploy the new code, and then contract the schema by removing the old shape. The expand phase and the contract phase are separated by at least one deployment, often longer. Between them, both the old and new shapes coexist. This is more work than a single ALTER, and it is the only way to do it safely on a table that is too large or too hot to lock.

The shapes of breaking changes

Every breaking schema change can be categorized by what it breaks:

Column rename: code that reads the old name breaks immediately.
Column type change: code that wrote the old type may produce values the new type rejects.
Column drop: code that wrote to the column gets an error.
Column NOT NULL addition: writes without the column fail.
Foreign key addition: writes that violate the constraint fail; backfill of existing rows that violate is required first.
Unique constraint addition: writes that violate fail; existing duplicates require resolution first.
Table rename or drop: code that references the old name breaks immediately.

Each of these has a two-phase pattern. The pattern is always: introduce the new shape alongside the old, write to both, deploy, switch reads, deploy, then remove the old.

Renaming a column: the canonical case

Suppose you have a column user_email that you want to rename to email_address. The single-phase approach — ALTER TABLE users RENAME COLUMN user_email TO email_address — is atomic in PostgreSQL, but it requires every reader and writer of the column to update simultaneously. You cannot deploy code in a rolling fashion if the schema changes synchronously. The two-phase approach:

Phase 1, expand: Add email_address as a new column. Backfill it from user_email for all existing rows. Deploy code that writes to both columns and reads from user_email (the old, source of truth). The schema now has both columns; the application writes both; reads still come from the old column. This is the safe interim state.

Phase 2, switch reads: Deploy code that reads from email_address (the new column) but continues to write to both. The new column is now the source of truth. The old column is being maintained for safety. This is reversible: you can roll back to phase 1 without data loss.

Phase 3, contract: Stop writing to user_email. Drop the column. The migration is complete.

Each phase is its own deployment, separated by enough time to be confident the previous phase is stable. For a hot table, the gap between phases might be a week. For a cold table, it might be a single deployment cycle. The discipline is the same.

Type changes: the trickiest case

Type changes are the migrations most likely to bite. If you want to widen price_cents INT to price_cents BIGINT, the rewrite cost in PostgreSQL is large but bounded. If you want to change a column from TEXT to UUID, the cost is unbounded because some existing values may not parse as UUIDs.

The pattern: add a new column with the target type. Write to both. Backfill the new column with a transformation that handles the conversion (or fails loudly when it cannot). Validate the backfill: run a query that compares the two columns and emits any rows where they disagree. Fix or document the disagreements. Switch reads to the new column. Drop the old.

The validation step is where teams skip and pay later. Skip it and you find the unconvertible rows in production six months later when one of them is read in a code path you forgot about.

Adding NOT NULL: the backfill order matters

Adding NOT NULL to an existing column requires that every row already have a value for that column. The naive sequence — add the column with a default, then set NOT NULL — has a hidden cost: setting NOT NULL in PostgreSQL requires a full table scan to verify, which holds an ACCESS EXCLUSIVE lock on a busy table.

The modern PostgreSQL pattern: add a CHECK constraint with NOT VALID, then validate it concurrently with ALTER TABLE ... VALIDATE CONSTRAINT, which acquires a much weaker lock. Once validated, the constraint behaves like NOT NULL for new writes, and you have proven that all existing rows comply without ever holding an ACCESS EXCLUSIVE lock. The actual SET NOT NULL can then run almost instantly because the planner knows the validation has been done.

This is the pattern almost every PostgreSQL migration tool now uses by default. If you are writing migrations by hand, look up the exact sequence for your version.

Backfilling without locking

Backfills are often the longest part of a migration. The single statement UPDATE big_table SET new_col = derive(old_col) works on a 100,000-row table and takes down a 100-million-row table because it holds row locks for too long.

The pattern: chunked backfill in a script that runs outside the migration. Process 1000 to 10,000 rows at a time, with a transaction per chunk. Sleep briefly between chunks to allow vacuum and other workloads to run. Track progress in a small metadata table or in a file so you can resume if the backfill is interrupted. The backfill takes longer in wall-clock time and is much kinder to the rest of the workload.

For SQLite, the same pattern applies: do not run a full-table UPDATE inside a single transaction on a large database. Chunk it, commit between chunks, and let WAL checkpointing happen between chunks rather than waiting until the end.

Testing migrations against realistic data

The hardest migrations to test are the ones whose pain only appears at scale. A migration that works on a 1000-row dev database may take six hours and lock a critical table on a 100-million-row production database. The fix is to test against a representative copy: a recent production snapshot, restored to a non-production environment, with the migration run end-to-end.

This is not optional infrastructure. If you do not have a way to restore production-shaped data into a test environment, you do not have a way to safely run migrations on a large production database. The first migration that costs you an outage will pay for the cost of building this many times over.

The rollback question

Two-phase migrations are reversible if you stop after phase 1 or phase 2. They are not reversible after phase 3. The discipline is to wait long enough between phase 2 and phase 3 to be confident that nothing in the system still depends on the old shape — including export jobs, replicas, audit-log readers, and human queries.

The rule we use: phase 3 happens at least one full data retention cycle after phase 2. If the table has 30-day retention on important rows, phase 3 happens 30 days after phase 2. This is conservative, and the conservatism has saved us at least twice from a forgotten consumer of an old column.

Migrations are the highest-stakes operation a small team performs against its own database. The two-phase pattern is the discipline that makes them survivable. Once internalized, it stops feeling like extra work and starts feeling like the obvious way to evolve a production schema.

If you want a working example end-to-end, the four products in our studio — DocuMint, CronPing, FlagBit, and WebhookVault — use a documented expand-contract migration pattern in production, with each migration broken into the explicit phases described above.