Postgres pg_stat_archiver: Watching WAL Archiving for Backup Confidence

WAL archiving is the foundation of point-in-time recovery in Postgres, and the silent-failure mode of archive_command misconfiguration is one of the most expensive operational mistakes a team can make. pg_stat_archiver exposes the counters that tell you whether your backups can actually be r...

WAL archiving is the foundation of point-in-time recovery in Postgres. If your archive_command is silently failing, your base backups become unrestorable to any point after the failure started. The pg_stat_archiver view is the single most important view in Postgres for catching this failure mode before it matters, and it is one of the most underwatched views in production.

What pg_stat_archiver exposes

The view has one row, with cumulative counters since the last reset. The columns that matter are archived_count (number of WAL files successfully archived), last_archived_wal and last_archived_time (which WAL file was last successfully archived and when), failed_count (number of failed archive attempts), last_failed_wal and last_failed_time (which WAL file was last attempted and failed, and when), and stats_reset.

The view is intentionally small. It does not show why an archive failed, only that it did. It does not show the contents of the WAL files or what database changes are in flight. It shows only the operational state of the archive pipeline: how many files have been archived, how many have failed, and when the most recent of each occurred.

The silent-failure mode

The default archive_command behavior is to retry indefinitely on failure. Postgres will not fail user transactions because archiving is broken; it will just keep trying to archive the same WAL file over and over. The pg_wal directory will grow unbounded because Postgres cannot recycle WAL segments that have not been archived. Eventually pg_wal will fill the disk and the database will refuse new writes. That is the loud failure mode that gets attention.

The quiet failure mode is much worse. Suppose your archive_command points at an S3 bucket and the bucket policy gets changed by an admin who does not realize the database depends on it. Every archive attempt now fails. The failed_count starts climbing. Your nightly base backup succeeds because base backup does not depend on archiving. Your monitoring shows green because nobody is alerting on failed_count. Weeks pass. The pg_wal directory grows but not catastrophically because your write volume is modest. One day you need to do a point-in-time recovery for a botched migration, and you discover that you have no WAL files beyond the moment archiving broke, which means you can restore to your last base backup but no further. The window of unrecoverable data is the gap between when archiving broke and now.

The fix is monitoring the right metric. failed_count incrementing is the leading indicator. last_failed_time more recent than last_archived_time is the lagging indicator that archiving is currently broken. Both are visible in pg_stat_archiver and both should produce alerts in any production deployment with point-in-time recovery requirements.

The everyday diagnostic queries

Three queries should be in any operator's toolbox.

First, current health snapshot:

SELECT archived_count, failed_count,
       last_archived_wal, last_archived_time,
       last_failed_wal, last_failed_time,
       now() - last_archived_time AS time_since_archive
FROM pg_stat_archiver;

Second, is archiving currently broken (this is the alert query):

SELECT
  CASE
    WHEN last_failed_time IS NULL THEN 'never_failed'
    WHEN last_failed_time > last_archived_time THEN 'currently_broken'
    ELSE 'recovered'
  END AS status,
  last_failed_wal, last_failed_time
FROM pg_stat_archiver;

Third, is archiving keeping up with WAL generation (this catches the slow-archiver case where archiving is succeeding but lagging behind):

SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0') / (16 * 1024 * 1024) AS current_wal_segment_number,
       last_archived_wal,
       now() - last_archived_time AS time_since_last_archive
FROM pg_stat_archiver;

The configuration knobs that matter

archive_mode must be on (not always or off) for the standard streaming-replication setup. archive_timeout forces an unfilled WAL segment to be archived after the specified interval, which bounds the recovery RPO at the cost of more, smaller WAL files in the archive. archive_command is the shell command that performs the archive; the convention is that it returns 0 on success and non-zero on failure, and that it is idempotent because Postgres may retry the same file multiple times.

The archive_command itself is the most common failure point. The command must be quoted carefully (Postgres substitutes %p for the source path and %f for the filename), must handle the case where the target already exists (idempotent), must verify the upload before returning success (a common bug is returning success when the upload silently fails partway through), and must propagate the correct exit code.

The recommended pattern for cloud storage is to use a purpose-built tool like pgbackrest or wal-g rather than a hand-rolled shell command. These tools handle the retry logic, the verification, the idempotency, and the failure-mode reporting in ways that are difficult to get right manually.

What pg_stat_archiver does not show

The view does not show why archiving failed. To diagnose, you need to look at the Postgres log file for archive_command output, the system log for the shell command, and the cloud provider logs for the storage backend. The log messages include the exit code of the archive_command and any stderr output the command produced.

The view does not show how full pg_wal is. That is in pg_stat_replication (for streaming replicas) and in os-level disk-free monitoring. Both should be monitored independently because they catch different failure modes.

The view does not show whether the archived files are actually intact. The only way to verify that is to perform a restore from the archive periodically. This is the restore-drill discipline that we have written about elsewhere; the short version is that backups you have not tested restoring are not backups, they are hope.

Our position across the four products

All four of our products (DocuMint, CronPing, FlagBit, WebhookVault) currently run on SQLite with periodic file-level snapshots rather than Postgres with WAL archiving. The SQLite analog is the .backup command and the WAL mode checkpoint behavior; both are operationally simpler than Postgres WAL archiving but do not support arbitrary point-in-time recovery. The planned Postgres migration would bring pg_stat_archiver monitoring into the operational dashboard as a first-class metric.

The deeper observation

The most important Postgres operational views are the ones that catch silent failures. pg_stat_archiver is the canonical example because the failure mode (unrestorable backups) is invisible until you need to restore and discover you cannot. The discipline of monitoring failed_count and last_failed_time vs last_archived_time is one of those cases where a simple alert can prevent an unrecoverable incident, and where the absence of the alert is one of the most common operational gaps in production Postgres deployments.

Our products: DocuMint (PDF invoice generation API), CronPing (cron job monitoring with status pages), FlagBit (feature flags API for modern teams), and WebhookVault (webhook capture and replay) keep the lights on.

Read more