Postgres pg_dump and Continuous Backups: A Strategy That Survives a Real Disaster
pg_dump is what most teams reach for and what most teams should not rely on alone. The honest backup strategy combines logical dumps for portability with physical base backups plus WAL archiving for point-in-time recovery.
The backup question every Postgres-using team eventually faces is what happens between the moment someone runs a destructive query and the moment they realize the mistake. If the answer is "we restore last night's pg_dump," the team has chosen one specific recovery story without considering the alternatives. The honest answer is more interesting and more demanding to set up, and the cost of getting it right is small relative to the cost of getting it wrong once.
We run this exact strategy across DocuMint, CronPing, FlagBit, and WebhookVault, with adaptations for the SQLite-backed services and full PostgreSQL where we use it. The mechanics differ across engines but the underlying decision framework is the same.
What pg_dump is and is not
pg_dump produces a logical backup: a stream of SQL statements (or a custom binary format that pg_restore can replay) that can recreate the schema and data in any compatible Postgres instance. Its strengths are portability across major versions, selective restore of individual tables or schemas, and human-inspectability of the output. These are real strengths and they cover the most common backup use case of "we need a copy of the database from a known good point."
What pg_dump is not is a real-time backup. The dump captures a transactionally consistent snapshot at the moment the dump starts, and everything written after that moment is not in the backup. If you run pg_dump nightly at 02:00 and a destructive query runs at 14:30, the only recovery is to restore the 02:00 snapshot and accept the loss of 12.5 hours of work. For most production systems this loss is the difference between an inconvenience and a business-ending incident.
pg_dump also takes time on large databases. A multi-hundred-gigabyte dump can take hours, holds an MVCC snapshot that blocks autovacuum from cleaning up dead tuples during the dump, and produces a file roughly the size of the data that has to be transferred and stored.
What base backup plus WAL archiving adds
The physical backup story is pg_basebackup plus continuous WAL archiving. pg_basebackup copies the actual data files of the cluster, producing a snapshot that can be restored as a running Postgres instance without needing to replay any SQL. WAL archiving copies every transaction log segment to durable storage as it is filled, capturing every modification made since the base backup.
The combination enables point-in-time recovery: restore the base backup, then replay WAL segments up to a target time (down to the second), and recover the database state as of that exact moment. The destructive query at 14:30:42 can be recovered from by restoring to 14:30:41.
The base backup is usually taken weekly, and WAL archiving is continuous. Recovery time depends on how much WAL has to be replayed, so the base backup frequency is a trade-off between storage cost and recovery time. Weekly base backups with continuous WAL is a reasonable default for most teams.
The retention policy decision
Retention is a separate decision from backup mechanism. The questions are how far back you need to recover and at what granularity. A reasonable default policy:
- Last 7 days: point-in-time recovery available to the second (full WAL retained).
- Last 30 days: daily snapshots available (one base backup per day).
- Last 12 months: weekly snapshots available (one base backup per week).
- Last 7 years: monthly snapshots in cold storage (compliance and rare audit cases).
The pricing math at small scale is favorable: a 100GB database with 5GB/day of WAL produces about 50GB of weekly base backups and 35GB of weekly WAL, which is roughly $2/month in S3 standard storage and a few dollars more in the lifecycle tiers. The recovery insurance is much cheaper than most teams assume.
The verification discipline
An untested backup is a hypothesis, not a backup. The single most underrated discipline in this space is the quarterly restore drill: pick a backup at random, restore it to a test environment, run integration tests against the restored database, and time the entire process. The first time you do this you will find at least one thing wrong, and probably two or three.
The common failures are: WAL archiving silently broken for weeks because the cron job was deleted, base backup script holding an exclusive lock that nobody noticed, retention policy deleting backups that were referenced by point-in-time recovery, restore process requiring Postgres extensions that are not installed in the recovery environment, encryption keys for backup files stored only on the original server. Each of these is the kind of bug that only manifests when you actually need the backup, which is the worst possible time to discover it.
The off-site requirement
Backups stored on the same host as the database protect against accidental delete and software bug, not against host loss, region failure, or ransomware. The modern interpretation of the 3-2-1 rule (3 copies, 2 media, 1 off-site) is 3 copies across at least 2 independent providers in at least 2 geographic regions. For a small SaaS, this typically means: live database on primary host, snapshot replication to managed object storage (S3, R2, GCS) in a different region, and a secondary copy to a different provider entirely.
The off-site copy needs its own access controls separate from the primary host, so that compromise of the primary does not automatically compromise the backups. The canonical case is encrypted backups uploaded to S3 buckets with Object Lock enabled and access keys distinct from any keys present on production hosts.
What we do across the products
For the SQLite-backed services (DocuMint, CronPing, FlagBit, WebhookVault) we run an hourly snapshot script: SQLite's .backup command produces a consistent copy without blocking writes, the snapshot is age-encrypted with a key stored only in the backup destination, and the encrypted file is uploaded to two object storage providers in different regions. WAL-mode SQLite plus the backup command gives us point-in-time recovery to the hour. Quarterly restore drills verify the pipeline end-to-end.
For Ghost (anethoth.com) we rely on its own database export plus content archive, with the same encrypt-and-upload pipeline. For Listmonk and Plausible (on PostgreSQL) we use pg_basebackup weekly plus continuous WAL archiving plus retention as described above. The total backup infrastructure cost is under $20/month across all services, and the verification discipline is on the calendar as a recurring quarterly task.
The deeper observation
Backup strategy is one of the topics where the difference between intermediate and senior engineering judgment is most visible. The intermediate engineer reaches for pg_dump and considers the problem solved. The senior engineer asks what recovery scenarios are possible, what recovery time is acceptable for each, what the costs of getting it wrong are, and how the verification will work. The honest answer is rarely the simplest one, and the value of the answer compounds with how unlikely the disaster is, because the unlikely disaster is the one nobody has rehearsed.