Postgres ALTER TABLE Online: Patterns That Don't Block Production
Most ALTER TABLE statements take an ACCESS EXCLUSIVE lock that blocks every read and write until the operation completes. Knowing which forms are safe and which ones rewrite the entire table is the difference between a five-second deploy and a two-hour outage.
The ALTER TABLE statement is the most common cause of unexpected production outages we see in Postgres. The syntax is identical across operations that take a few milliseconds and operations that hold an ACCESS EXCLUSIVE lock for hours. The Postgres documentation lists the lock level for each variant, but reading the entire table of contents at three in the morning during an incident is not the same as knowing the patterns up front. This is a survey of the ALTER TABLE forms that bite, the ones that are safe, and the rewrite-the-statement-or-use-a-different-tool patterns that turn dangerous operations into safe ones.
What ACCESS EXCLUSIVE actually means
Postgres has eight lock modes that form a partial order. The two that matter for ALTER TABLE are ACCESS EXCLUSIVE (blocks every other operation on the table including SELECT) and SHARE UPDATE EXCLUSIVE (allows concurrent SELECT and DML but blocks other ALTER, VACUUM, and ANALYZE operations). The third mode that shows up is ACCESS SHARE, which is what every SELECT holds and which conflicts only with ACCESS EXCLUSIVE.
The practical consequence: if your ALTER takes ACCESS EXCLUSIVE and any session is actively reading the table, your ALTER waits. While it waits, every new transaction that touches the table also waits, behind the ALTER. This is the classic lock queue: a long-running SELECT plus an ACCESS EXCLUSIVE ALTER plus a steady stream of new SELECTs produces a brief outage. The ALTER itself may finish in milliseconds once it acquires the lock, but the wait can be unbounded.
The standard mitigation is SET lock_timeout = '5s' before the ALTER. If the lock cannot be acquired within five seconds, the ALTER fails fast and can be retried, rather than queuing behind a long-running transaction and blocking everything else in the meantime. Production migration scripts should always set this.
The cheap operations
Several ALTER TABLE forms take ACCESS EXCLUSIVE but complete in O(1) without rewriting the table. They are safe in the sense that the lock is held only for the duration of catalog updates, not for any data scan:
ALTER TABLE ... ADD COLUMN ... [DEFAULT constant]since Postgres 11. The metadata-only optimization for constant defaults means new columns appear in the catalog immediately, with the default returned at read time from the catalog rather than written to every row.ALTER TABLE ... DROP COLUMN. The column is marked dropped in the catalog and ignored by subsequent queries. The data remains in the heap until the row is updated or the table is rewritten. This is fast but the disk space is not reclaimed until a VACUUM FULL or a CLUSTER.ALTER TABLE ... ADD CONSTRAINT ... NOT VALIDfollowed byALTER TABLE ... VALIDATE CONSTRAINT. The first step is catalog-only with ACCESS EXCLUSIVE briefly; the second takes SHARE UPDATE EXCLUSIVE and scans the table.ALTER TABLE ... ENABLE/DISABLE TRIGGER. Catalog-only.ALTER TABLE ... SET (autovacuum_*). Catalog-only per-table autovacuum tuning.ALTER TABLE ... RENAME COLUMN. Catalog-only but breaks every cached plan and every application that references the old name.
The operations that rewrite
Several ALTER TABLE forms hold ACCESS EXCLUSIVE for the duration of a full table rewrite. They are not safe for any table large enough that the rewrite takes longer than your acceptable outage window:
ALTER TABLE ... ADD COLUMN ... DEFAULT non_constant_expression. Functions likenow(),uuid_generate_v4(), or anything with random output are not constant and force a rewrite.ALTER TABLE ... ALTER COLUMN ... TYPEwhen the type change requires reformatting the column (e.g., text-to-uuid, int-to-bigint with explicit cast, json-to-jsonb).ALTER TABLE ... ALTER COLUMN ... SET NOT NULLwhen done directly. The Postgres 12+ shortcut uses an existing CHECK constraint to skip the scan, which is the right pattern.ALTER TABLE ... ADD COLUMN ... GENERATED ALWAYS AS ... STORED. The expression must be evaluated and stored for every existing row.ALTER TABLE ... ADD PRIMARY KEYwhen the table does not already have a unique index on the column. The CREATE UNIQUE INDEX CONCURRENTLY plus ALTER ... ADD PRIMARY KEY USING INDEX two-step is the safe pattern.VACUUM FULLandCLUSTER. Not ALTER TABLE but worth mentioning because they share the rewrite-and-lock characteristic and pg_repack is the right alternative for production.
The two-step patterns
Most rewrite operations have a two-step alternative that avoids the rewrite under ACCESS EXCLUSIVE. The patterns are worth memorizing:
Adding a NOT NULL constraint to an existing column:
-- Step 1: add a CHECK constraint validating the column is not null
ALTER TABLE users ADD CONSTRAINT users_email_not_null CHECK (email IS NOT NULL) NOT VALID;
-- Step 2: validate the constraint (SHARE UPDATE EXCLUSIVE, allows reads and writes)
ALTER TABLE users VALIDATE CONSTRAINT users_email_not_null;
-- Step 3 (Postgres 12+): convert to NOT NULL constraint using the validated CHECK
ALTER TABLE users ALTER COLUMN email SET NOT NULL;
-- Step 4: drop the redundant CHECK constraint
ALTER TABLE users DROP CONSTRAINT users_email_not_null;The trick in step 3 is that Postgres 12 added the optimization to skip the full-table scan if a validated CHECK constraint proves the column is not null. Without this optimization, the SET NOT NULL would do its own full-table scan under ACCESS EXCLUSIVE.
Adding a foreign key:
-- Step 1: add the foreign key as NOT VALID (catalog-only, ACCESS EXCLUSIVE briefly)
ALTER TABLE orders ADD CONSTRAINT orders_user_id_fkey
FOREIGN KEY (user_id) REFERENCES users(id) NOT VALID;
-- Step 2: validate (SHARE UPDATE EXCLUSIVE on orders, ROW SHARE on users)
ALTER TABLE orders VALIDATE CONSTRAINT orders_user_id_fkey;Step 2 still takes a lock but a much weaker one that allows concurrent reads and writes on both tables.
Adding a unique index that will become a primary key:
-- Step 1: build the unique index without blocking
CREATE UNIQUE INDEX CONCURRENTLY users_email_uniq ON users (email);
-- Step 2: promote to primary key (ACCESS EXCLUSIVE briefly, catalog-only)
ALTER TABLE users ADD CONSTRAINT users_pkey PRIMARY KEY USING INDEX users_email_uniq;Changing a column type:
This is the hardest case and often requires the expand-write-both-switch-reads-contract pattern. The simplified version:
-- Step 1: add new column
ALTER TABLE events ADD COLUMN created_at_new TIMESTAMPTZ;
-- Step 2: backfill in chunks (application code or a script)
-- Step 3: dual-write from application code
-- Step 4: switch reads to new column
-- Step 5: drop old columnStep 2 must be chunked. A single UPDATE on a large table holds locks and bloats the heap. The pattern is to UPDATE 1000-10000 rows per transaction with the WHERE clause filtering by primary key range and the new column being null, with the worker tracking progress in a separate table.
What CREATE INDEX CONCURRENTLY costs and gives back
The CONCURRENTLY variant of CREATE INDEX is the workhorse of online schema changes. It takes SHARE UPDATE EXCLUSIVE rather than ACCESS EXCLUSIVE, allowing concurrent reads and writes. The cost: the build is slower (typically 2-3x), requires two scans of the table, and cannot run inside a transaction. If the build fails (often due to a unique constraint violation discovered during the second scan), the index is left in an INVALID state and must be dropped with DROP INDEX CONCURRENTLY before retry.
The diagnostic for invalid indexes:
SELECT schemaname, tablename, indexname
FROM pg_indexes
WHERE indexname IN (
SELECT c.relname FROM pg_class c
JOIN pg_index i ON c.oid = i.indexrelid
WHERE NOT i.indisvalid
);This should always return zero rows. If it ever returns rows, your monitoring or your migration script left a mess that the planner will not use and that bloats every write to the table.
The lock_timeout discipline
Every production migration script should set lock_timeout and statement_timeout. The standard pattern:
BEGIN;
SET LOCAL lock_timeout = '5s';
SET LOCAL statement_timeout = '30s';
ALTER TABLE ... ;
COMMIT;The lock_timeout protects against queueing behind a long-running transaction. The statement_timeout protects against the operation itself running longer than expected. The SET LOCAL scope means these only apply to the current transaction.
The retry loop in the migration runner should handle the resulting timeout errors by waiting briefly (with jitter) and retrying. Most production migration tools have this built in; if yours does not, write the retry loop before you need it.
What ALTER TABLE does not do
Several common assumptions are wrong. ALTER TABLE does not always invalidate cached prepared statements; sometimes the plan keeps using stale catalog state until the connection is reset. ALTER TABLE does not always cascade through partitioned tables uniformly; some operations propagate to all partitions and others must be applied per partition. ALTER TABLE does not update the planner statistics; an ANALYZE is often needed after a significant schema change. ALTER TABLE does not unlock until COMMIT, so wrapping multiple ALTERs in a single transaction holds the strictest lock for the duration of the slowest operation.
The migration runner playbook
A production-grade migration system should: enforce lock_timeout and statement_timeout on every migration, refuse to run rewriting ALTER forms on tables above a configurable row threshold (or require explicit opt-in), log the start and end of every migration to a versioned migrations table for audit and rollback, use CREATE INDEX CONCURRENTLY for all index creation rather than CREATE INDEX, use the NOT VALID / VALIDATE pattern for all new constraints, and run the migration against a copy of production-scale data before deployment to catch unexpected lock duration.
Most teams discover these patterns through painful experience. The patterns themselves are not novel and are documented in the Postgres documentation, but the connection between the documentation and the operational consequences is only obvious after the first ALTER TABLE outage. Our four products run SQLite, where the equivalent considerations are different (SQLite ALTER TABLE has historically been limited to a small set of operations, with the broader set arriving in 3.35+, and the table-rewrite cost is bounded by the page-locking model rather than connection-level locking). When we migrate any of them to Postgres, the migration runner will enforce these patterns from day one.
The deeper observation: schema migrations are one of the few places where Postgres behaves dramatically differently in development than in production. A migration that takes 10 milliseconds on an empty laptop table takes hours on a billion-row production table and may produce an outage. The patterns above turn most of these dangerous operations into safe ones, but only if the team has internalized the patterns before the production load reaches the threshold where the safe defaults stop being safe.
Our products: DocuMint (PDF invoice generation API), CronPing (cron job monitoring with status pages), FlagBit (feature flags API for modern teams), and WebhookVault (webhook capture and replay) keep the lights on.