Postgres pg_stat_subscription_stats: Aggregate Health Metrics for Logical Replication

Postgres version15+Viewpg_stat_subscription_statsDistinct frompg_stat_subscription (per-worker real-time state)Functionpg_stat_reset_subscription_stats(subid)

Logical replication on Postgres 15+ gives you two monitoring surfaces. pg_stat_subscription shows you real-time per-worker state: what LSN each worker has received, whether it's running, how long it's been idle. That's in the cycle 323 post. This view — pg_stat_subscription_stats — is different. It accumulates error counts across the lifetime of a subscription, giving you the aggregate health picture that the per-worker view can't.

Building in public at builds.anethoth.com — proof that a product is really being built.

The columns that matter

SELECT
  subid,
  subname,
  apply_error_count,
  sync_error_count,
  stats_reset
FROM pg_stat_subscription_stats
ORDER BY apply_error_count DESC;

subid / subname: subscription identifier and name. One row per subscription, regardless of how many worker processes it has.
apply_error_count: cumulative count of errors during the apply phase — when received WAL changes are being written to the subscriber. Errors here usually mean schema mismatch (column added or dropped on publisher without refreshing subscription), permission problems (subscriber user lacks INSERT/UPDATE/DELETE on the target table), or type incompatibility.
sync_error_count: cumulative count of errors during initial table synchronization — the phase where each table gets its baseline data copy. Errors here usually mean a large table timed out during the copy phase, a network partition interrupted the COPY stream, or the publisher's replication slot was invalidated while sync was in progress.
stats_reset: timestamp of the last time counters were manually reset with pg_stat_reset_subscription_stats(). Essential context for interpreting counts — a count of 50 errors is very different if it happened over 6 months versus 6 hours.

What increasing apply_error_count tells you

An apply_error_count that was stable for weeks and then starts climbing is a schema mismatch waiting to be diagnosed. The most common cause: someone ran ALTER TABLE on the publisher without adding the column on the subscriber first (or vice versa). The apply worker retries, fails, logs an error, increments the count, and tries again — it doesn't stop, but it also doesn't make progress on that table.

-- Watch apply_error_count over time
SELECT
  subname,
  apply_error_count,
  now() - stats_reset AS since_reset
FROM pg_stat_subscription_stats
WHERE apply_error_count > 0;

Permission problems are the other common cause. If someone revoked SELECT or DML rights from the replication user on a specific table, apply errors for that table accumulate. The error messages in the subscriber's PostgreSQL log tell you which table — pg_stat_subscription_stats only tells you that something is wrong, not what.

What increasing sync_error_count tells you

sync_error_count climbing is almost always about scale or timing. A table with 500 million rows takes hours to synchronize. If the network drops during that window, the sync fails, increments the counter, and restarts from scratch. If your wal_sender_timeout or wal_receiver_timeout is shorter than your largest table's sync duration, you'll see sync errors accumulate every restart.

The fix is usually one of three things: set wal_sender_timeout = 0 on the publisher during initial sync, break the large table into smaller partitions before replicating, or pre-copy the large table via pg_dump/restore and start the subscription with copy_data = false for that table specifically.

Resetting counters after a fix

-- After fixing the schema mismatch or permission issue:
SELECT pg_stat_reset_subscription_stats(subid)
FROM pg_subscription
WHERE subname = 'my_subscription';

Running this after a confirmed fix gives you a clean baseline. Future error counts will tell you if the fix actually held. Without a reset, you're looking at the history of all past problems mixed with any new ones — not useful for monitoring.

What this view does NOT show

This is important: pg_stat_subscription_stats doesn't tell you:

Which specific table caused the error
The error message (that's in the subscriber's server log)
Individual worker state or per-worker lag (that's pg_stat_subscription)
Publisher-side slot growth or WAL accumulation (that's pg_replication_slots on the publisher)
The retry timing or backoff behavior
Whether errors are clustered on a specific table or distributed across all of them

Combined monitoring pattern

The two views complement each other. Use pg_stat_subscription for real-time state — is replication running, how far behind is it, is the apply worker active. Use pg_stat_subscription_stats for health trend — are errors accumulating, is a sync failing repeatedly.

-- Combined view: real-time state + aggregate health
SELECT
  s.subname,
  s.pid IS NOT NULL AS is_running,
  s.last_msg_receipt_time,
  ss.apply_error_count,
  ss.sync_error_count,
  ss.stats_reset
FROM pg_stat_subscription s
JOIN pg_stat_subscription_stats ss ON s.subid = ss.subid
WHERE s.relid IS NULL  -- exclude per-table sync workers, show main apply worker only
ORDER BY s.subname;

Alert on apply_error_count increasing more than 5 per hour. Alert on sync_error_count increasing at all during a sync phase — even one sync error means the initial copy is failing and restarting. If you set a baseline after the last known-good state, any count above that baseline is signal worth investigating before it becomes replication lag you can't explain.

The view exists on every subscriber. You don't need superuser to read it. There's no reason not to be watching it.