Why Your Database Backups Are Probably Broken: Testing Restore, Not Just Dump

The exit code is 0. The dump file exists and it has the right size. Your backup ran. The question you haven't answered is whether you could restore from it.

Three Failure Modes That Only Appear on Restore

Encoding mismatch between dump and target. pg_dump includes encoding information from the source database. If your target has a different encoding—a fresh database created without explicitly specifying encoding—psql will fail on restore with character encoding errors on rows containing non-ASCII characters. This passes dump-side validation entirely. The dump is correct. The target doesn't accept it.

The fix is straightforward: create the target database with the same encoding and locale as the source. createdb --encoding=UTF8 --locale=en_US.UTF-8 restore_target before restoring.

Ownership and permissions referencing roles that don't exist on the target. pg_dump preserves GRANT statements and object ownership. If your production database has objects owned by a role named app_user, the restore will fail with "role app_user does not exist" unless that role exists on the target. This is especially common when restoring to a fresh RDS instance, a different Postgres cluster, or a developer's local machine.

pg_dump's --no-owner and --no-privileges flags bypass this, but using them silently changes the restore—which may be wrong if your application depends on row-level security policies tied to role membership.

Sequences not resuming at the correct position after logical restore. When you dump and restore using pg_dump's default format, sequences for serial/bigserial/identity columns may not resume from the correct value. Under some restore conditions, sequences can reset to their starting position. Your application then tries to insert with id=1 and hits a duplicate key violation.

After a logical restore, verify sequences:

SELECT sequencename, last_value
FROM pg_sequences
ORDER BY sequencename;

Compare against the maximum values in corresponding tables. If they don't match, use setval() before handing the database to your application.

Exit Code 0 Doesn't Mean Restorable

pg_dump exits 0 on success. "Success" means "I finished writing the dump." It doesn't mean "the target database will accept this dump." You can have a perfectly valid pg_dump output file that will fail to restore, and pg_dump has no way to know this—it doesn't know where you're restoring to.

The Minimum Viable Restore Test

# 1. Create a scratch database
createdb --encoding=UTF8 --locale=en_US.UTF-8 restore_test_$(date +%Y%m%d)

# 2. Restore the dump
pg_restore -d restore_test_$(date +%Y%m%d) backup.dump

# 3. Run a diagnostic query that exercises your schema
psql -d restore_test_$(date +%Y%m%d) -c "SELECT COUNT(*) FROM users"

# 4. Clean up
dropdb restore_test_$(date +%Y%m%d)

The diagnostic query should be meaningful—not just a connection test, but a query that would fail if the restore were incomplete, if a table were missing, or if sequences were wrong.

WAL Archiving vs Logical Backup

pg_dump is a logical backup. Point-in-time recovery requires WAL archiving. These are not substitutes—they're complements with different failure modes.

A logical backup gives you a snapshot at a moment in time. Restoring from it means replaying all DDL and data writes from the dump, which is slow for large databases and doesn't give you arbitrary point-in-time recovery.

WAL archiving gives you continuous recovery to any point, but requires a base backup plus all intervening WAL segments. If any segment is missing, recovery stops at the last valid point before the gap.

The typical production configuration: weekly logical backup (for portability and cross-cluster restores) plus continuous WAL archiving (for PITR). Testing both independently matters—a WAL restore test is more involved, but it's the only way to know your WAL pipeline is intact.

Automating the Smoke Test

#!/bin/bash
set -e
RESTORE_DB="restore_smoke_$(date +%Y%m%d)"
trap "dropdb $RESTORE_DB 2>/dev/null || true" EXIT

createdb --encoding=UTF8 $RESTORE_DB
pg_restore -d $RESTORE_DB /backups/latest.dump

ROW_COUNT=$(psql -At -d $RESTORE_DB -c "SELECT COUNT(*) FROM users")
if [ "$ROW_COUNT" -lt 1 ]; then
  echo "RESTORE SMOKE TEST FAILED: users table empty" >&2
  exit 1
fi

echo "Restore smoke test passed: $ROW_COUNT users"

The trap ensures cleanup even on failure. If this script fails, your on-call rotation gets paged. The goal is that "the backup works" becomes a fact you verify continuously, not an assumption you hold until you need it.

Building something? List it on builds.anethoth.com. More posts at anethoth.com.