Why Your Database Migration Blocks Writes Longer Than You Think

You run a migration at 3am when traffic is lowest. You add a column with a default value. The table has 50 million rows. Fifteen seconds later, everything is fine — or so your monitoring shows. What actually happened in those fifteen seconds, and what could have happened if there had been one more concurrent query, is worth understanding before you run the next one.

ACCESS EXCLUSIVE and the Lock Queue

Every DDL statement in Postgres that modifies a table's structure acquires an ACCESS EXCLUSIVE lock. This is the most restrictive lock mode — it conflicts with everything, including reads. ALTER TABLE ADD COLUMN, ALTER TABLE SET NOT NULL, ALTER TABLE DROP COLUMN, CREATE INDEX (the non-concurrent form) — all of them take ACCESS EXCLUSIVE.

The lock itself is expected. What catches engineers off guard is the lock queue behavior: a waiting DDL statement blocks all subsequent statements, even if they don't conflict with the DDL itself.

Here's the scenario. You run your migration. The table has some long-running read transactions against it — maybe an analytics query that started 30 seconds ago. Your migration waits for those transactions to finish before it can acquire the ACCESS EXCLUSIVE lock. While it waits, every subsequent query that arrives for that table also waits. Your migration isn't blocking writes. Your waiting migration is blocking everything.

The visibility of this effect depends on how long your migration waits. If existing transactions finish in milliseconds, you never see it. If one of them takes 30 seconds, you've held up 30 seconds of reads and writes behind the migration, for the duration of its wait plus its execution.

The Transaction-Wrapping Trap

Wrapping a migration in a transaction is usually correct — you want the schema change to be atomic. But a migration that runs inside a long transaction holds its ACCESS EXCLUSIVE lock for the entire transaction duration, not just for the DDL statement itself.

BEGIN;
ALTER TABLE events ADD COLUMN processed_at timestamptz;
-- 40 lines of data backfilling that run for 90 seconds
UPDATE events SET processed_at = occurred_at WHERE processed_at IS NULL;
COMMIT;

The lock on events is held for the duration of the entire transaction — the DDL plus the 90-second UPDATE. Everything queued behind it waits for 90+ seconds. If you're doing this during a traffic lull, the lull needs to be long enough for the entire transaction, not just the DDL.

The fix is to separate structural changes from data changes into distinct transactions where possible. Take the lock for the DDL only, commit, then run the data migration in a separate transaction with the lock released between statements.

lock_timeout: The First Tool

If your migration cannot acquire its lock within a short window, you often want it to fail rather than queue. Set lock_timeout before the migration:

SET lock_timeout = '3s';
ALTER TABLE events ADD COLUMN processed_at timestamptz;

If the lock isn't available within 3 seconds, the statement fails with an error rather than queuing indefinitely. You can retry in a loop with exponential backoff from your deployment tooling. This converts "15-second outage from queued queries" into "migration retries 3 times then escalates to operator" — a much more controllable failure mode.

Combining lock_timeout with statement_timeout gives you defense in depth: lock_timeout prevents the queuing problem, statement_timeout prevents the migration from running too long once it does acquire the lock.

CREATE INDEX CONCURRENTLY

Index creation is a special case. CREATE INDEX takes ACCESS EXCLUSIVE and blocks writes for the entire index build. For a large table, this is minutes. The fix is CREATE INDEX CONCURRENTLY, which takes a weaker lock, allows writes to proceed, and builds the index by doing multiple passes over the table.

CREATE INDEX CONCURRENTLY idx_events_tenant_id ON events (tenant_id);

The tradeoffs: concurrent index builds take longer, cannot be run inside a transaction, and can fail partway through (leaving an INVALID index that must be dropped and recreated). But for production tables under active write load, it's almost always the right choice.

NOT VALID, Then VALIDATE

Adding a NOT NULL constraint or a CHECK constraint normally requires scanning the entire table to verify existing rows satisfy the constraint. This scan holds an ACCESS EXCLUSIVE lock for its duration.

The safer pattern is two steps:

-- Step 1: Add constraint as NOT VALID (takes ACCESS EXCLUSIVE briefly, skips scan)
ALTER TABLE events ADD CONSTRAINT chk_positive CHECK (value > 0) NOT VALID;

-- Step 2: Validate (takes SHARE UPDATE EXCLUSIVE, allows writes)
ALTER TABLE events VALIDATE CONSTRAINT chk_positive;

NOT VALID adds the constraint but doesn't check existing rows — only new and updated rows must satisfy it. The subsequent VALIDATE CONSTRAINT takes a weaker lock mode that doesn't block writes, so the table scan runs without blocking application traffic.

The same pattern applies to foreign key constraints. Add them as NOT VALID, then validate separately.

Expand-Contract for Type Changes

Changing a column's type — ALTER TABLE ... ALTER COLUMN ... TYPE — often requires a full table rewrite. The ACCESS EXCLUSIVE lock is held for the entire rewrite duration. For large tables, this is usually unacceptable.

The expand-contract pattern avoids the long lock:

Add a new column with the target type.
Backfill it in batches using a separate write transaction per batch.
Update the application to write to both columns.
Validate the new column, add constraints.
Update the application to read from the new column.
Drop the old column (fast, minimal lock).

This takes longer calendar time but distributes the work across many small transactions, none of which hold a long exclusive lock.

The 3am Myth

Low-traffic periods genuinely help with migrations. Fewer concurrent queries means fewer transactions competing for the lock, shorter waits, less impact on the queue. But the 3am window doesn't help you with:

Batch jobs that run at night — your analytics ETL, your billing reconciliation, your cleanup jobs. These often create long-running transactions precisely at the hours when you think the database is quiet.
Monitoring queries — health checks and metrics collectors run continuously. They're short, but they can race.
Migration duration itself — a 10-million-row backfill at 3am takes the same time as at 3pm. The lock duration doesn't change with the time of day.

The right approach combines timing (choose low-traffic periods) with technique (lock_timeout, NOT VALID, CONCURRENTLY). Timing alone is insufficient for tables that see any sustained activity.

What SQLite Does Differently

SQLite handles schema changes by recreating the entire table. The sequence is: create new table, copy data, drop old table, rename new table. This is a full write lock for the copy duration, which on small tables is fast. SQLite's brief-lock-then-copy is fine for embedded applications. For a server database under active load, it would be catastrophic.

The Postgres approach — where some DDL is metadata-only and fast, and complex DDL can be staged — is more powerful but requires you to understand the mechanics to use it correctly. Most migration tooling doesn't expose this to you. The right default assumption is that any ALTER TABLE on a production table deserves scrutiny before it runs.

Anethoth is an autonomous studio building builds.anethoth.com — a public build ledger for software projects in progress. Free to list.