Postgres pg_stat_archiver in Depth: Catching WAL Archiving Failures Before They Compound
WAL archiving is the foundation of point-in-time recovery, and the most common failure mode is silent: archive_command fails, pg_wal fills up, and nobody notices until the cluster refuses writes. pg_stat_archiver catches the failure before it compounds.
Most Postgres clusters that depend on point-in-time recovery have an archive_command configured in postgresql.conf that ships completed WAL segments to long-term storage, typically S3 or another object store via a sidecar tool like pgBackRest or wal-g or a hand-rolled shell command. The mechanism is conceptually simple: Postgres calls the configured command for each WAL segment that fills, and a return code of zero signals success. The success path is invisible because nothing has to happen for operators to notice it; segments arrive at the destination, and recovery scenarios can use them.
The failure path is the operationally dangerous one. When archive_command returns a non-zero exit code, Postgres retains the WAL segment in pg_wal and tries again. If the failure persists, segments accumulate. The pg_wal directory grows until it fills the disk. The cluster eventually refuses to accept writes when the disk fills, which is the failure that finally produces a visible incident. The chronology is predictable: the failure starts, the failure count quietly increments, segments accumulate over hours or days, disk usage climbs, and the catastrophic failure happens long after the initial cause. The pattern is the classic silent-failure-with-delayed-catastrophe shape that operators encounter repeatedly across database systems.
What pg_stat_archiver exposes
The pg_stat_archiver view is the system view that catches the failure before it compounds. The view has a single row with six columns: archived_count is the cumulative number of WAL segments successfully archived since the last reset, last_archived_wal is the name of the most recently archived segment, last_archived_time is the timestamp of that archival, failed_count is the cumulative number of failed archive_command invocations, last_failed_wal is the segment name of the most recent failure, and last_failed_time is the timestamp of that failure.
The diagnostic questions the view answers include: is archiving working, when did it last succeed, when did it last fail, and what is the success rate over a recent window. The view does not tell you why a failure occurred or whether the archived files are intact at the destination; those questions require log inspection and restore drills respectively. The view is one of the small set of system views whose value is primarily about catching silent failures rather than about quantitative performance analysis. The right cadence for checking the view is more frequent than the operational rhythms might suggest because the cost of missing a failure window is high.
The everyday diagnostic queries
The first query is the current health snapshot: select last_archived_time, last_failed_time, failed_count, and the seconds since last_archived_time. The output is the smallest amount of information that distinguishes a healthy archiver from a failing archiver. The seconds-since-last-archived value is the leading indicator: if the value exceeds the typical inter-segment interval by more than a small multiple, archiving is either falling behind or has stopped entirely, and either condition warrants investigation. The interval depends on the write rate of the cluster; a busy cluster fills WAL segments every few seconds while a quiet cluster might go many minutes between segments.
The second query is the alerting query: select 1 from pg_stat_archiver where last_failed_time > last_archived_time. The query returns a row when the last failure is more recent than the last success, which is the condition that distinguishes intermittent failures from sustained failures. The presence of a row should trigger an alert because sustained failure is the condition that produces the disk-fill incident. The query is cheap and can run frequently; the right cadence is at least once per minute for clusters where point-in-time recovery is a load-bearing capability.
The third query is the keeping-up calculation: compare the most recently archived WAL segment name to the current WAL segment from pg_current_wal_lsn. The Postgres function pg_walfile_name(pg_current_wal_lsn()) returns the current segment, and the comparison to last_archived_wal indicates how many segments behind archiving is. The pattern is the WAL-side equivalent of replication slot lag monitoring, with the same interpretation: small gaps are normal during high write activity, sustained large gaps indicate that the archive_command cannot keep up with the write rate.
The configuration knobs that matter
The archive_mode parameter is the master switch. The right value for production clusters that need point-in-time recovery is on; the alternative value of always also enables archiving during recovery, which is useful for cascading replica setups. The parameter requires a restart to change, which is the kind of friction that produces operational drift if not caught during initial deployment.
The archive_command parameter is the shell command Postgres invokes for each completed WAL segment. The right value is usually a wrapper command from pgBackRest, wal-g, or a similar tool rather than a hand-rolled shell pipeline, because the wrapper commands handle the corner cases around retry, idempotency, and destination authentication that hand-rolled commands tend to get wrong. The configuration accepts %p for the source path and %f for the file name, and the exit code zero signals success. A non-zero exit code signals failure, and Postgres treats any non-zero code identically; the wrapper command is responsible for distinguishing transient from permanent failures and retrying appropriately.
The archive_timeout parameter forces a WAL segment switch after the specified interval, which bounds the maximum age of unarchived data. The right value for clusters where recovery point objective matters is a small multiple of the recovery point objective expressed in seconds; a one-minute archive_timeout produces at most one minute of unrecoverable data even if writes are slow enough that segments would otherwise accumulate for hours before filling. The trade-off is that aggressive archive_timeout values produce more small WAL files, which slightly increases storage cost and inflates the segment count in restore scenarios.
What the view does not show
The view does not show why a failure occurred. The exit code and any stderr output from archive_command appear in the Postgres log rather than in pg_stat_archiver, and diagnosing the cause of a failure requires reading the log around the timestamp of last_failed_time. The combination of pg_stat_archiver for detection and log inspection for diagnosis is the right operational pattern; using the view alone produces alerts without enough context to act on them.
The view does not show pg_wal directory fullness. The relationship between archiving failures and disk usage is the load-bearing reason archiving matters operationally, but pg_stat_archiver does not expose disk metrics. Monitoring pg_wal size requires separate disk-usage monitoring at the operating system level. The right pattern is to alert on both pg_stat_archiver showing recent failures and pg_wal directory size exceeding a threshold, because either signal indicates a problem and the combination of both is the high-severity incident.
The view does not show whether the archived files are actually intact at the destination. The archive_command returns zero on success, but success at the command level does not guarantee the file is readable, complete, or stored durably. The verification requires periodic restore drills that fetch a sample of archived segments and use them in test recovery scenarios. The drill is the only thing that closes the confidence-fitness gap between archive_command returning zero and the archived files being usable for actual recovery.
The application across our four products
Our four products run SQLite, which does not have WAL archiving in the Postgres sense. SQLite's WAL mode produces a write-ahead log that lives alongside the main database file and is periodically checkpointed back into the main file. The closest analog to Postgres WAL archiving is the combination of nightly file-level backups via .backup or VACUUM INTO plus continuous replication via Litestream to S3. The Litestream sidecar emits its own observability metrics including replication lag and last successful sync, which we monitor via a small custom dashboard.
The Postgres migration plan for the four products includes archive_mode=on with a wrapper command from pgBackRest, archive_timeout set to 60 seconds to bound RPO, pg_stat_archiver monitoring with alerts on failure_count incrementing and on last_failed_time being more recent than last_archived_time, separate pg_wal directory size monitoring with alerts at 50 and 80 percent of disk, and quarterly restore drills using sampled archived segments. The investment is small and the alternative is a silent failure mode that compounds over days into a disk-fill incident.
The deeper observation is that the most important operational views in Postgres are the ones that catch silent failures rather than the ones that produce performance reports. pg_stat_archiver, pg_replication_slots, pg_stat_activity for idle-in-transaction sessions, and pg_stat_database_conflicts for standby cancellations are all in this category: they are small views whose primary value is converting a class of silent failure into a noisy alert. The alternative to monitoring them is discovering the failures only when they compound into incidents that are much more expensive to recover from than the original failure would have been. The discipline of monitoring the silent-failure views is one of the operational practices that distinguishes mature Postgres deployments from new ones.
Our products: DocuMint (PDF invoice generation API), CronPing (cron job monitoring with status pages), FlagBit (feature flags API for modern teams), and WebhookVault (webhook capture and replay) put these patterns into production.