Postgres pg_stat_subscription: Watching Logical Replication From the Subscriber Side

The view that tells you whether your logical replication subscriber is actually keeping up. Reading pg_stat_subscription is half the diagnostic story that pg_stat_replication on the primary cannot tell you alone.

Postgres logical replication is a two-sided system. The publisher side has pg_stat_replication, which tells you what each connected subscriber is doing from the publisher's perspective. The subscriber side has pg_stat_subscription, which tells you what each subscription is doing from the subscriber's perspective. The two views show different things, and a complete picture of a logical replication setup requires reading both. Most teams set up logical replication, monitor pg_stat_replication on the publisher, and miss the subscriber-side diagnostics that explain failures the publisher-side view cannot.

What pg_stat_subscription shows

The view has one row per subscription on the subscriber instance. The columns that matter operationally are pid (the worker process running the subscription, NULL if not currently running), received_lsn (the last LSN the subscriber has received from the publisher), last_msg_send_time and last_msg_receipt_time (publisher-side and subscriber-side timestamps of the last message, useful for measuring network and processing lag separately), latest_end_lsn (the last LSN whose apply has completed), and latest_end_time (when that apply completed).

The basic health check is whether pid is NULL when it should not be. A logical replication subscription that has crashed and not restarted will show NULL pid, and the subscriber will not catch up until the worker is restarted. The autovacuum-like behavior of the launcher process should normally restart workers, but if the worker is failing repeatedly (typically due to a conflict the subscriber cannot apply), the launcher will back off and the subscription will fall further behind.

The receive-vs-apply lag distinction

Two different lags can hurt a logical replication subscriber, and they have different remediation. Receive lag is the gap between what the publisher has produced and what the subscriber has received. Apply lag is the gap between what the subscriber has received and what it has applied to its own tables.

Receive lag points at network or publisher problems: slow link, publisher under load, replication slot configured but the subscriber connection broken. The diagnostic is comparing pg_stat_replication.sent_lsn on the publisher (what has been sent) with pg_stat_subscription.received_lsn on the subscriber (what has been received). A growing gap means the bytes are leaving the publisher but not arriving at the subscriber.

Apply lag points at subscriber-side problems: the apply worker is single-threaded by default, conflicts are blocking specific transactions, the subscriber is CPU- or I/O-bound on apply. The diagnostic is comparing pg_stat_subscription.received_lsn with pg_stat_subscription.latest_end_lsn. A growing gap means bytes are arriving but not being applied. Postgres 16 added parallel apply for streaming transactions, which helps the apply-bound case but does not fully solve the single-stream-of-changes problem.

The conflict problem

Logical replication conflicts are the failure mode most teams hit eventually. A row that exists on the subscriber but not the publisher, a unique constraint violation, a row that does not exist where an UPDATE expects it to: these all produce conflicts that stop the apply worker. The subscriber's pg_stat_subscription will show the worker disappearing and restarting and disappearing again. The Postgres log will show the specific error.

The historic fix was to manually resolve the conflict (delete the problematic row, advance the slot, restart) and re-enable replication. Postgres 15 added subscription parameters disable_on_error and skip_lsn to make this less manual: disable_on_error stops the subscription when a conflict is hit rather than retrying forever, and skip_lsn lets you tell the subscriber to skip past a specific transaction. Postgres 16 expanded the conflict-detection metadata available.

The deeper point is that logical replication assumes the subscriber schema can accept what the publisher sends. If the subscriber has additional constraints, additional triggers that fire on apply, or schema drift relative to the publisher, the apply will fail at the first conflict and stop until manually resolved. The discipline is to treat the subscriber schema as part of the replication contract and not modify it casually.

The everyday diagnostic queries

Three queries should be in any operator's toolbox.

First, basic subscription health on the subscriber:

SELECT subname, pid IS NOT NULL AS active,
       received_lsn, latest_end_lsn,
       pg_wal_lsn_diff(received_lsn, latest_end_lsn) AS apply_lag_bytes,
       last_msg_receipt_time, latest_end_time
FROM pg_stat_subscription;

Second, comparing receive lag against publisher state requires querying both sides. The simplest pattern is to run the same query on publisher (pg_stat_replication.sent_lsn) and subscriber (pg_stat_subscription.received_lsn) and compute the difference.

Third, the most-common-failure-mode query is whether the worker is stable or crash-looping:

SELECT subname, pid, last_msg_receipt_time,
       EXTRACT(EPOCH FROM (now() - last_msg_receipt_time)) AS seconds_since_last_msg
FROM pg_stat_subscription;

A subscription with NULL pid and last_msg_receipt_time more than a few minutes old is either crashed, in launcher backoff, or paused. The Postgres log will say which.

The replication slot interaction

Logical replication uses a replication slot on the publisher to track which WAL bytes the subscriber has consumed. This slot guarantees that WAL is retained until the subscriber confirms it. If the subscriber stops consuming (worker crash, network partition, deliberate pause), the slot will retain WAL on the publisher indefinitely, potentially filling the publisher's disk.

The publisher-side mitigation is max_slot_wal_keep_size, which limits how much WAL a slot can retain before the slot is invalidated and the subscriber must be reinitialized. The subscriber-side mitigation is monitoring pg_stat_subscription for stuck workers and alerting on them quickly so the publisher slot does not bloat unnoticed.

This is the most common way logical replication causes disk-fill incidents on the publisher: a subscriber goes offline for an extended period without anyone noticing, and the publisher's WAL grows to fill the disk. The single highest-leverage alert in a logical replication setup is an active-but-not-progressing subscription on either side.

What pg_stat_subscription does not show

The view does not show conflict details (those are in the log), does not show per-table state (which can matter when only some publication tables are causing problems), does not show parallel apply worker state for streaming transactions (added partially in Postgres 16 via pg_stat_subscription_stats), and does not show subscriber-side resource consumption (CPU, memory, I/O) that may be the actual bottleneck. A complete diagnostic requires combining pg_stat_subscription with pg_stat_activity (to see what the apply worker is currently doing), pg_locks (to see if the apply worker is blocked), and the Postgres log (for conflict details).

Our products and Postgres migration

For all four of our products (DocuMint, CronPing, FlagBit, WebhookVault) we are SQLite-based, with single-instance deployment and no replication. SQLite has no analog to logical replication; the closest thing is Litestream-style continuous backup to object storage, which is one-way and snapshot-based rather than transactional. When we eventually migrate one or more products to Postgres for workloads that warrant it, logical replication becomes one of the patterns we will use (for read replicas, for major-version upgrades, for separating analytics from transactional load), and pg_stat_subscription becomes one of the views we will monitor.

The deeper observation is that logical replication is genuinely useful but operationally subtle in ways that the documentation does not emphasize. The mental model needs to include both sides of the connection (publisher slot, subscriber worker), both kinds of lag (receive lag, apply lag), and the conflict-resolution discipline that turns a flaky subscriber into a stable one. The views pg_stat_replication on the publisher and pg_stat_subscription on the subscriber are the structured form of that mental model. Teams that monitor only one side eventually have an incident the other side would have predicted.


This essay is part of our ongoing series on operating production databases. Our products DocuMint (PDF invoice generation API), CronPing (cron job monitoring with status pages), FlagBit (feature flags API for modern teams), and WebhookVault (webhook capture and replay) all run on SQLite today, with eventual Postgres migrations planned for the workloads that warrant it.

Read more