Postgres Replication Slots: How to Prevent the Disk-Fill Disaster Every Team Eventually Has

A replication slot tells the primary to keep WAL files until a downstream consumer has read them. If the consumer goes away and nobody notices, the WAL pile grows until the disk fills. It happens to teams that should know better.

The Postgres replication slot is one of those features that solves a real problem at the cost of introducing a new and worse one. The problem it solves is replication catch-up: if a standby goes offline briefly, the primary needs to know not to recycle the WAL files the standby will need when it comes back. Without a slot, the primary will eventually overwrite the WAL files the standby needs, and the standby will have to be reseeded from a base backup. With a slot, the primary keeps the WAL files indefinitely, and the standby can catch up whenever it returns.

The new problem is that "indefinitely" is a long time. If the standby never comes back, or if a logical replication subscriber crashes and is forgotten, the primary keeps WAL files forever, growing the pg_wal directory until the disk fills and the database stops accepting writes. This is the disk-fill disaster, and it happens to teams that should know better, because the failure mode is silent for weeks and catastrophic in minutes.

What a replication slot does

A replication slot is a small metadata record stored on the primary that names a downstream consumer and tracks the WAL position that consumer has confirmed reading up to. There are two kinds: physical slots used by streaming replication standbys, and logical slots used by logical replication subscribers and other consumers (Debezium, custom CDC tooling, pglogical, and so on). Both kinds work the same way operationally: the primary refuses to recycle WAL files newer than the slot's confirmed position.

The slot persists across primary restarts, which is the feature. The slot does not expire automatically when the consumer disappears, which is the bug-shaped-like-a-feature. If you create a slot for a standby, and the standby is replaced with a different machine that creates a new slot, the old slot sits there forever holding WAL until somebody manually drops it.

The disk-fill failure mode

The chronology of the disaster is predictable. Day zero: someone creates a logical replication slot for a CDC pipeline. Day thirty: the pipeline is decommissioned but the slot is forgotten. Day sixty: the WAL pile has grown from a few hundred MB to several GB, but disk usage charts show a slow growth that nobody notices. Day ninety: the WAL pile is hundreds of GB and somebody asks why the database disk is filling up. Day ninety-one: nobody acts. Day ninety-five: the disk fills, Postgres refuses to write to WAL, and the database stops accepting commits.

The recovery is painful. Once the disk is full, you cannot drop the slot (the primary is unresponsive), you cannot truncate the WAL (Postgres needs disk space to commit the drop), and you cannot easily add disk capacity to a running database. The standard recovery is to shut Postgres down, manually delete the slot's persistent state files from pg_replslot/, restart, and then drop the slot through SQL. The downtime is measured in hours; the data is not at risk if you are careful, but if you delete the wrong slot files, you can corrupt a working standby.

The configuration knob that exists in Postgres 13+

Postgres 13 added max_slot_wal_keep_size, which sets a ceiling on how much WAL the primary will retain on behalf of a slot. When the ceiling is exceeded, the primary marks the slot as "lost" and starts recycling WAL anyway. The slot becomes useless (the consumer will have to be reseeded) but the database stays alive.

Most teams should set this to a value comfortably smaller than the disk size and treat it as a safety net rather than a primary control. Common values are 50-200 GB depending on how much WAL the workload produces per hour; the sizing target is "enough WAL to cover a multi-hour outage of a downstream consumer, but not enough to fill the disk if the consumer is forgotten." The default of -1 (unlimited) is the wrong default for production.

The companion knob is wal_keep_size (formerly wal_keep_segments), which is the minimum WAL retained for any standby, slotless or not. This one should be small (a few hundred MB) because it applies to every WAL file the primary recycles. Setting both knobs together gives you a floor and a ceiling on WAL retention.

The monitoring that catches the disaster early

The query that catches a slot disaster before it becomes a disaster: SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal, age(xmin) AS xmin_age FROM pg_replication_slots ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC.

The retained_wal column is the size of the WAL the slot is currently holding back. The active column indicates whether a consumer is currently connected; false with significant retained WAL is the danger signal. The xmin_age column matters for logical slots: it indicates how many transactions back the slot is preventing autovacuum from cleaning up, which is the secondary disaster mode where slots cause table bloat in addition to disk fill.

The alert thresholds: at 5 GB retained WAL on a slot, log a warning. At 25 GB or 10 percent of disk, page somebody. At 50 GB, alert as critical. The exact numbers depend on workload, but the principle is that you want to know about the problem days before it becomes a disaster, not hours before.

The lifecycle discipline

The behavioral fix is more reliable than the configuration fix. The discipline: every slot creation goes through a script or process that records why the slot exists, who owns it, and what the cleanup criterion is. Slots are tagged with metadata when possible (Postgres does not natively support slot tags, but a separate audit table works) and reviewed quarterly. Slots that have been inactive for more than seven days trigger an alert that requires explicit acknowledgment.

The temporary-slot variant (CREATE_REPLICATION_SLOT ... TEMPORARY) is the right tool for short-lived consumers. Temporary slots are dropped automatically when the connection that created them ends, so a crashed consumer cleans up after itself. Use temporary slots for ad-hoc CDC work, backfills, and anything where the consumer is not a long-running production component. Use permanent slots only for replication standbys and production CDC pipelines, and treat the permanent slot list as a critical resource that gets the same review attention as production database accounts.

Across our four products

We run DocuMint, CronPing, FlagBit, and WebhookVault on SQLite. SQLite has no replication slot equivalent because it has no replication; the operational model is single-instance with periodic backups, which avoids this class of problem entirely. When any product migrates to Postgres for its next scaling step, replication slot management becomes load-bearing operational infrastructure that needs to be in place before the first slot is created, not retrofitted after the first disaster.

The deeper observation is that database features that work indefinitely in the happy path are usually the ones that fail catastrophically in the unhappy path. Replication slots, prepared transactions, long-running idle-in-transaction sessions, two-phase commit transactions, and uncommitted savepoints all share this pattern: they consume resources to provide a guarantee, the guarantee is rarely used, and the resource consumption is invisible until it is not. The operational discipline is to treat each of these features as having an expiry date by default, with explicit renewal required, rather than treating them as default-permanent infrastructure. Postgres 13's max_slot_wal_keep_size made the discipline easier to enforce at the database layer, but the behavioral version was the actual fix all along.

Read more