Postgres pg_stat_database_conflicts: Diagnosing Standby Query Cancellations

Standby queries get cancelled because they conflict with WAL replay, and the only honest way to diagnose the problem is pg_stat_database_conflicts. The view exposes five conflict types and the trade-off matrix between query stability and replication lag.

Running read replicas in Postgres is one of the easier ways to scale read traffic, but the documentation does not prepare teams for one specific operational surprise: long-running queries on the standby get cancelled because they conflict with WAL replay. The cancellation looks like a generic database error to the application, and without the right view it is hard to tell why. The view that explains it is pg_stat_database_conflicts.

Why standby query cancellation happens

The standby is continuously applying WAL records from the primary. Some WAL records cannot be applied without changing data that an in-progress query on the standby is reading. The cleanest example is a row that the primary has VACUUM'd: the WAL record removes the row, but a query on the standby that started before the WAL record arrived is still reading it. Postgres can either delay the WAL apply (causing replication lag) or cancel the query (causing application errors). The default is to cancel the query after a brief delay.

This is not a bug. It is the unavoidable consequence of running read queries against a database that is continuously being modified by replicated writes. The trade-off between query stability and replication freshness is real, and the operator has to pick a point on the curve.

What pg_stat_database_conflicts exposes

The view has one row per database with five conflict-type columns:

  • confl_tablespace: WAL replay requires a tablespace that a query is using. Rare in most deployments.
  • confl_lock: WAL replay requires a lock that a query holds. Usually means a DDL replication conflict.
  • confl_snapshot: WAL replay removes rows still visible to a query's snapshot. The most common conflict in practice and the one VACUUM-driven cancellations show up as.
  • confl_bufferpin: WAL replay needs to evict a buffer a query has pinned. Usually means very short-running but very frequent queries.
  • confl_deadlock: Replication and a query are in a deadlock cycle. Rare.

The headline number is confl_snapshot. If you have standby query cancellations and confl_snapshot is non-zero and growing, the cause is VACUUM on the primary removing rows that long-running queries on the standby still need to see.

The configuration knobs

Three settings interact to control the trade-off:

max_standby_streaming_delay sets how long the standby will delay WAL replay before cancelling queries. Default is 30 seconds. Increasing this gives queries more time to finish but increases replication lag. Setting it to -1 means unlimited delay, which can cause indefinite lag but never cancels queries.

hot_standby_feedback = on tells the standby to send the oldest active query's xmin back to the primary, which makes the primary avoid VACUUM'ing rows that standby queries still need. This eliminates most snapshot conflicts but causes bloat accumulation on the primary if standby queries run long enough to retain xmin for a long time.

vacuum_defer_cleanup_age (deprecated in Postgres 16+) is the older alternative to hot_standby_feedback for delaying VACUUM cleanup, with the same bloat trade-off.

The diagnostic pattern

The query to keep handy:

SELECT datname, confl_snapshot, confl_bufferpin, confl_lock,
       confl_tablespace, confl_deadlock
FROM pg_stat_database_conflicts
WHERE datname IS NOT NULL
ORDER BY confl_snapshot DESC;

Run on the standby. The deltas matter, not the cumulative values; pg_stat_database_conflicts is cumulative since last reset.

If confl_snapshot is the dominant counter, the action is either to increase max_standby_streaming_delay (accept more lag) or to enable hot_standby_feedback (accept bloat on primary). If confl_bufferpin dominates, the issue is short-frequent queries and the action is often to consolidate them. If confl_lock dominates, the issue is DDL replication; consider scheduling DDL during off-peak.

What the view does not tell you

The view does not record which queries were cancelled, only the count of cancellations by type. To get the queries, you need pg_stat_activity sampling on the standby with state-change logging, or application-side logging of cancellation errors with query text.

The view also does not distinguish between cancellations that affected real queries vs short-running queries that completed before the deadline. The count is cancellations, not query-impact.

And the view does not include current replication lag in seconds or bytes. Use pg_stat_replication on the primary or pg_last_wal_receive_lsn / pg_last_wal_replay_lsn comparison on the standby for lag.

Our use across the four products

DocuMint, CronPing, FlagBit, and WebhookVault are SQLite-based and have no streaming replication, so pg_stat_database_conflicts has no equivalent. Our Postgres migration plan includes a standby for read-traffic scaling, and the migration plan explicitly includes monitoring this view as part of the standby setup. We expect to start with hot_standby_feedback = on and a moderate max_standby_streaming_delay, then tune from observed conflict counts.

The deeper observation is that running read replicas is a useful technique but it introduces operational surprises that single-instance Postgres does not have. The pg_stat_database_conflicts view is one of the standby-specific tools that surfaces an entire class of failure mode that does not exist on the primary. The pattern recurs across replication topologies: each new operational layer adds its own observability requirements, and the absence of the right view is often more informative than its content.

Our products: DocuMint (PDF invoice generation API), CronPing (cron job monitoring with status pages), FlagBit (feature flags API for modern teams), and WebhookVault (webhook capture and replay) keep the lights on.

Read more