Why Your Database Backup Is Not a Backup Until You Restore It

The rule is simple: an untested backup is not a backup. It is a hope stored on disk.

The failure mode is predictable. You set up pg_basebackup, you verify it completes, you check the success email, and you believe you have a backup. You do not have a backup. You have a file that will probably restore correctly, with a failure rate you have not measured, discovered the moment you need it most.

Three failure modes that look fine until restore day

1. Corrupted WAL segments

pg_basebackup completes successfully. WAL archiving shows green. The archive directory fills with files. But a silent disk error — a single bad sector, a storage controller that silently reordered writes, a NFS mount that dropped a packet and retried at the wrong offset — corrupted one WAL segment in the middle of the archive.

The corruption is not detected at write time. The file exists. It has the right size. The checksum was never verified. You discover this during restore, when pg_restore reaches that segment and reports:

invalid record length at 0/3A000028: wanted 24, got 0
LOG: startup process (PID 391) exited with exit code 1

This is 3 AM. The primary database is down. You need to restore from backup. The backup is invalid. You are now discovering this.

The prevention is straightforward: verify WAL segments after archiving. A simple script that reads each archived segment and checks the record structure catches corruption before you need the backup. pg_waldump will fail loudly on a corrupt segment. Run it on the most recent archived segment daily.

2. Missing tablespace references

Your backup captured $PGDATA. The backup script was written in 2022, when all your data lived there. In 2023, someone added a tablespace on /mnt/fast-ssd for the tables that needed low-latency reads. The tablespace mount was not in the backup script — it happened after the script was written, and nobody updated it.

The restore succeeds. Postgres starts. Half your tables are missing. The tablespace exists in pg_tablespace, but the files it points to were not backed up. The data is not corrupted — it was simply never captured.

This one is caught by: after restore, run a count query on the five largest tables. If any return zero rows or a relation-does-not-exist error, your backup is incomplete. A restore test that only verifies "Postgres started" is not a restore test.

3. Version mismatch

pg_dump from Postgres 16 generates SQL with features and syntax available in Postgres 16. If your disaster recovery target is a Postgres 14 instance — your DR environment was set up in 2022 and nobody upgraded it — the import will fail on the first statement it cannot parse.

The reverse is usually safe: a dump from Postgres 14 imported into Postgres 16 generally works, because Postgres maintains backward compatibility for SQL syntax. But not always. Features removed or deprecated in the newer version can cause failures in unexpected places.

Version mismatch is discovered at restore time, when your DR database is a different version than production. It is prevented by keeping DR environment Postgres versions synchronized with production, and by testing restores into the actual DR environment rather than a developer laptop running whatever version was convenient to install.

The discipline: weekly automated restore to a scratch instance

The solution is a restore test that runs on a schedule and fails loudly if anything is wrong:

docker run --rm -e POSTGRES_PASSWORD=test postgres:16 &
# wait for container
pg_restore -h 127.0.0.1 -U postgres -d test latest.dump
psql -h 127.0.0.1 -U postgres -d test -c "SELECT COUNT(*) FROM orders"
psql -h 127.0.0.1 -U postgres -d test -c "SELECT COUNT(*) FROM users"
psql -h 127.0.0.1 -U postgres -d test -c "SELECT COUNT(*) FROM products"
# if any count is suspiciously low, alert
docker stop ...

Weekly. Automated. Failure sends an alert. The restore is against the same Postgres version as production. The smoke queries cover the five largest tables. The container is destroyed after the test — you are not accumulating stale test environments.

For SQLite on small stacks (like our Builds directory): Litestream handles continuous WAL replication to S3. The restore test is simpler — download the current Litestream snapshot, run litestream restore, open the resulting database file with SQLite, query a row count. If that passes, the backup is good. Run it weekly. Alert if it fails.

What backups do not protect against

Application-level logical corruption discovered days later. If your application wrote bad data — a bug that multiplied all prices by 100, a migration that set the wrong default, a cascade delete that removed records it should not have — and that corruption was present for three days before anyone noticed, your most recent backup contains the bad data. Your week-old backup might not. This is the problem point-in-time recovery solves: you can restore to the moment before the bad write. Backups without PITR are binary — latest or nothing.

Ransomware that encrypts backup storage too. If your backups live on an attached network share, and that share is accessible from the compromised host, ransomware will encrypt the backups along with everything else. Offline copies — S3 with object lock, tape, a drive physically disconnected from the network — are the defense. "S3 bucket" is not automatically safe; a compromised AWS credential with write permissions to the bucket is sufficient to overwrite or delete backups.

Gradual data rot in columns nobody queries. A column that slowly accumulates NULL values when it should not, a timestamp that has been drifting wrong for months, a foreign key reference that has become invalid without triggering a constraint — these are not backup problems. They are data integrity problems that backups faithfully preserve. Regular data audits catch them; backups do not.

The test cadence that actually works

The goal is to know your backup is good before you need it, not while you need it. Weekly is a reasonable cadence for most applications — frequent enough that a new failure mode is discovered within a week rather than discovered during an incident. For higher-stakes data, daily restore tests are achievable and worth the CI/CD complexity.

The restore test should take less than 30 minutes to run. If your restore takes hours, your DR window is hours, and you should know that before an incident forces you to discover it.

Test the restore. Then test the smoke queries. Then verify the Postgres version matches. Then destroy the test environment. Automate it, alert on failure, and ignore it the rest of the time — that is the discipline.

Written for Anethoth. Builders: check out builds.anethoth.com.