Postgres pg_replication_slots: Why Streaming and Logical Slots Need Different Monitoring

Replication slots are one of the most operationally dangerous Postgres features because their failure mode is silent disk fill. The view exposes the data to monitor, but the alerts require different thresholds for streaming and logical slots.

Postgres replication slots are a primitive that solves a specific problem (preventing the primary from recycling WAL segments that a replica or logical consumer still needs) and creates a specific failure mode (silent WAL accumulation when the consumer disappears, eventually filling the disk and producing a complete outage of the primary). The pg_replication_slots view exposes the state of every slot, and the data it carries is enough to catch the failure mode before it becomes an incident. The problem is that streaming-replication slots and logical-replication slots have different operational characteristics and require different alert thresholds; the single-threshold approach that most teams start with either alerts too often on healthy logical slots or alerts too late on broken streaming slots.

What the view exposes

The columns that matter for monitoring are slot_name (the identifier used to manage the slot), slot_type (physical for streaming replication, logical for logical replication), active (whether a consumer is currently connected), active_pid (the backend serving the consumer if active), xmin and catalog_xmin (the oldest transaction IDs the slot is holding open), restart_lsn (the WAL position the slot has not yet released), and confirmed_flush_lsn (for logical slots, the position the consumer has acknowledged). Postgres 13 added wal_status (a categorical health indicator: reserved, extended, unreserved, lost) and safe_wal_size (the bytes of WAL the slot can still hold before reaching the configured limit).

The single most useful derived value is the WAL bytes the slot is currently holding, calculated as pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn). This is the metric that should drive alerts, because it directly measures the disk-fill risk the slot represents. The query for everyday monitoring is straightforward:

SELECT slot_name, slot_type, active, wal_status,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained,
  pg_size_pretty(safe_wal_size) AS headroom
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;

Why streaming and logical slots need different thresholds

Streaming-replication slots are typically read by hot-standby replicas that consume WAL continuously and lag behind the primary by milliseconds to seconds in normal operation. A streaming slot showing 1 GB of retained WAL means the standby is substantially behind or has disconnected; the right action is to investigate the standby immediately. A 5 GB threshold is conservative for streaming slots, a 25 GB threshold indicates urgent intervention, and a 50 GB threshold is approaching the catastrophic-failure region where the disk-fill clock is running.

Logical-replication slots have different operational characteristics. Logical decoding produces an event stream from WAL, which is consumed by an application that may be running batch processing, applying changes through complex transformations, or living behind a queue that buffers consumption. A logical slot showing 1 GB of retained WAL is not necessarily abnormal; the consumer may be processing a large transaction or recovering from a restart. The right thresholds are higher: 50 GB is the right warning level, 100 GB is urgent, and 200 GB is approaching disk fill. The absolute byte values depend on the disk size, but the ratio of streaming to logical thresholds is typically 10x or larger.

The active column changes the alert logic significantly. An inactive streaming slot is a problem at any nonzero retained WAL size, because the standby is by definition not consuming. An inactive logical slot may be in a brief reconnect window where the consumer is restarting, and the right behavior is to wait a few minutes before alerting. The combined query that handles both cases:

SELECT slot_name, slot_type, active,
  pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS retained_bytes,
  CASE
    WHEN slot_type = 'physical' AND retained_bytes > 25 * 1024 * 1024 * 1024 THEN 'CRITICAL'
    WHEN slot_type = 'physical' AND retained_bytes > 5 * 1024 * 1024 * 1024 THEN 'WARNING'
    WHEN slot_type = 'logical' AND retained_bytes > 100 * 1024 * 1024 * 1024 THEN 'CRITICAL'
    WHEN slot_type = 'logical' AND retained_bytes > 50 * 1024 * 1024 * 1024 THEN 'WARNING'
    ELSE 'OK'
  END AS alert_level
FROM pg_replication_slots
WHERE wal_status != 'reserved' OR retained_bytes > 5 * 1024 * 1024 * 1024;

The max_slot_wal_keep_size safety net

Postgres 13 added max_slot_wal_keep_size, which converts the catastrophic slot-fills-disk scenario into a contained slot-becomes-lost scenario. When a slot reaches this limit, Postgres stops retaining additional WAL for the slot, marks it as wal_status = 'lost', and continues normal operation. The consumer that was reading the slot will fail when it tries to reconnect, but the database stays up. The right value depends on the disk size and the operational requirement; 50 GB to 200 GB is typical for production deployments. The setting is reload-only (not restart-only), so it can be tightened during an incident.

The trade-off is between preserving consumer state and protecting the primary. If a streaming standby's slot is lost, the standby cannot catch up via WAL replay and must be reseeded from a base backup. If a logical consumer's slot is lost, the consumer must restart from a fresh snapshot, which may not be straightforward for some applications. The max_slot_wal_keep_size setting prioritizes the primary's availability over the consumer's continuity, which is usually the right trade-off but not always.

The xmin and catalog_xmin trap

Replication slots hold open more than just WAL. The xmin column shows the oldest transaction ID the slot prevents from being vacuumed; this directly impacts dead-tuple cleanup across the cluster. A slot with a stuck xmin produces table bloat that vacuum cannot reclaim, which compounds over days and weeks into significant disk waste and query performance regression. The catalog_xmin is the equivalent for system catalog cleanup and primarily affects DDL-heavy workloads.

Monitoring slot xmin requires comparing against the current transaction ID and alerting when the gap grows large. The query:

SELECT slot_name, slot_type,
  age(xmin) AS xmin_age,
  age(catalog_xmin) AS catalog_xmin_age
FROM pg_replication_slots
WHERE xmin IS NOT NULL
ORDER BY age(xmin) DESC;

The threshold for alerting is workload-dependent. A xmin age of 100 million on a low-write database is fine; the same age on a high-write database means substantial bloat is accumulating. The right pattern is to baseline the value during normal operation and alert on deviation.

What the view does not show

The view shows current state but not history. A slot that recently caught up after lagging shows healthy current state, but the lag spike may have been the warning that the consumer is undersized. The right pattern is to sample pg_replication_slots at one-minute intervals into a metrics store and alert on both current state and rate of change.

The view does not show why a slot is behind. A logical slot with 50 GB retained WAL could be a slow consumer or a sudden write burst the consumer is catching up from. The diagnostic for distinguishing these is the rate of change of confirmed_flush_lsn: a consumer that is processing at the WAL generation rate has flat retained-WAL bytes, a consumer that is catching up has decreasing retained bytes, and a consumer that is falling behind has increasing retained bytes.

The view does not show consumer-side problems. A logical consumer that is processing events but applying them slowly downstream may appear healthy to the slot view while producing customer-visible lag. The right pattern is to monitor both the slot side (retained WAL) and the application side (last-applied event timestamp) and treat the larger of the two as the binding constraint.

Our SQLite-baseline products and Postgres migration plan

DocuMint, CronPing, FlagBit, and WebhookVault all run on SQLite, which has no equivalent slot mechanism. The closest analog is the WAL checkpoint and the Litestream sidecar's last-replicated position, both of which we monitor but with smaller blast radius than Postgres slots.

When we migrate to Postgres (planned for FlagBit first as the product with the largest write rate and most complex schema), pg_replication_slots monitoring will be part of the launch observability investment. The plan is to start with conservative thresholds (5 GB warning, 25 GB critical for streaming slots; 50 GB warning, 100 GB critical for logical slots) and tighten as the workload pattern stabilizes. The max_slot_wal_keep_size will be set to 200 GB initially, which provides ample headroom while ensuring the primary cannot be taken down by a forgotten slot.

The deeper observation is that replication slots are one of the few Postgres features where the operational discipline is more important than the technical detail. The mechanism is straightforward, the monitoring queries are not difficult, and the failure modes are well-documented. Teams that have slot-related incidents have them because nobody was watching the metrics, not because the metrics were unavailable. The right pattern is to set up monitoring before any slots exist, so that the first slot created automatically has alerts attached, and to review slot existence quarterly to catch slots that were created during incident response and never cleaned up. The slot is a load-bearing primitive for several Postgres replication topologies, and the small investment in monitoring pays back the first time it catches a disappeared consumer before the disk fills.


Our products: DocuMint (PDF invoice generation API), CronPing (cron job monitoring with status pages), FlagBit (feature flags API for modern teams), and WebhookVault (webhook capture and replay) put these patterns into production.