Postgres pg_stat_replication: Monitoring Streaming Replication Lag Before It Bites You

Skill levelIntermediatePostgres version10+Time10 min

Streaming replication is running. Your replica is configured. You think you're covered.

Then a failover happens and you find out the replica was four minutes behind. Four minutes of writes, gone.

The view that would have told you is pg_stat_replication. It has been there since Postgres 9.1. Most people do not look at it until something has already gone wrong.

The View

On the primary, run this:

SELECT * FROM pg_stat_replication;

You get one row per connected standby. The columns that matter:

client_addr: IP of the standby. If NULL, it is a local connection.
state: streaming is normal. catchup means the replica is behind and actively catching up. startup or backup indicates initial sync.
sent_lsn: How far the primary has sent WAL to this standby.
write_lsn: How far the standby has written WAL to its own disk.
flush_lsn: How far the standby has flushed to disk and is crash-safe.
replay_lsn: How far the standby has replayed WAL into its data files. This is what the replica can actually serve.
sync_state: async or sync. Synchronous standbys wait for flush acknowledgment before the primary returns commit success to the client.

Calculating Lag

LSN values are 64-bit integers representing byte positions in the WAL stream. Subtract them to get bytes of lag:

SELECT
  client_addr,
  state,
  sync_state,
  pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag_bytes
FROM pg_stat_replication;

Postgres 10 added write_lag, flush_lag, and replay_lag interval columns directly to the view. These are wall-clock measurements at each replication stage, not byte counts. Use them when available:

SELECT
  client_addr,
  state,
  sync_state,
  write_lag,
  flush_lag,
  replay_lag,
  pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
FROM pg_stat_replication
ORDER BY lag_bytes DESC;

The byte count tells you how much work remains. The interval columns tell you how long it took to get to the current position. Both are useful for different things: bytes for capacity planning, intervals for alerting.

When to Alert

For async replication, the right threshold depends on your recovery time objective. A reasonable starting point:

Warn if replay_lag > 30 seconds
Page if replay_lag > 5 minutes
Page immediately if the standby disappears from the view entirely

A standby that disappears from pg_stat_replication means the WAL sender on the primary lost the connection. Check the primary's PostgreSQL log for the disconnect reason. Common causes: network partition, replica OOM kill, replica restart.

For sync replication, lag in this view maps directly to your write latency. If you are seeing flush_lag above a few hundred milliseconds on a synchronous standby, the standby is under pressure — disk, CPU, or network. That merits investigation before it becomes a write-path bottleneck for your application.

Synchronous vs Async: Different Guarantees, Different Risks

Asynchronous replication (the default) means the primary sends WAL without waiting for the standby to confirm receipt. The primary commits are fast, but the standby can fall arbitrarily far behind. If the primary fails before the standby has replayed all WAL, the data in the gap is lost.

Synchronous replication waits for at least one standby to flush WAL before returning commit success. This eliminates the data loss window but adds round-trip latency to every commit. The standby's flush_lag in pg_stat_replication is the tax you pay.

If you have synchronous_standby_names configured and the named standby falls behind or disconnects, your writes will stall. The synchronous_commit = remote_apply setting adds an additional wait for the standby to replay, not just flush — which increases the guarantee but also the latency.

What It Does Not Show

pg_stat_replication only covers physical streaming replication. It tells you nothing about:

Logical replication slots: Use pg_replication_slots for those. Logical slots can accumulate WAL indefinitely if the subscriber disconnects — a disk space hazard that physical replication does not have, since physical standbys have their own wal_keep_size limit.
Disk I/O on the replica: A replica applying WAL fast enough to keep up in bytes might still be thrashing disk. Check pg_stat_io on the replica itself (Postgres 16+) or pg_stat_bgwriter on earlier versions.
Query performance on the replica: Hot standby conflicts — where a query on the replica is cancelled to allow WAL replay to proceed — appear in pg_stat_database_conflicts, not here.
Whether the replica's data is consistent enough to use for reads: That is a separate question about hot standby configuration and recovery state, not answered by this view.

Diagnostic Patterns

A standby stuck in state = catchup with growing replay_lag is not keeping up. Causes to check: replica under CPU pressure from concurrent read queries, disk I/O saturation on the replica from WAL replay competing with queries, network throughput limit between primary and replica.

If sent_lsn equals replay_lsn and all lag columns are zero, the replica is current. That is the state you want to confirm before a planned failover. Do not failover to a standby that shows nonzero replay_lag without accepting the data loss that represents.

A good monitoring query to run on a cron or from your observability stack:

SELECT
  client_addr,
  state,
  sync_state,
  replay_lag,
  pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes,
  CASE WHEN replay_lag IS NULL THEN 'no_lag_data'
       WHEN replay_lag > interval '5 minutes' THEN 'critical'
       WHEN replay_lag > interval '30 seconds' THEN 'warning'
       ELSE 'ok' END AS lag_status
FROM pg_stat_replication;

If the view returns zero rows and you expect standbys, the WAL sender is not connected. That is a different problem than lag — it means no replication is happening at all.

Building an indie SaaS and want to track your progress publicly? builds.anethoth.com is a directory for founders building in the open — real metrics, transparent stages.