Postgres pg_basebackup: Building Physical Backups Without Locking Production
The tool that takes a binary copy of a running cluster while writes continue. The right primitive for point-in-time recovery and standby provisioning, but the documentation underweights the operational details that matter.
Postgres comes with a tool called pg_basebackup that takes a binary copy of an entire running cluster while writes continue. It is the right primitive for two distinct operational needs: creating the starting point for point-in-time recovery, and provisioning a streaming replica or hot standby. The same tool serves both jobs because the underlying mechanism is the same: a consistent snapshot of every file in the data directory, taken from a running primary without taking the primary offline.
What pg_basebackup does
pg_basebackup connects to a running Postgres instance, asks it to enter backup mode, copies every file in the data directory and every configured tablespace, and asks the instance to exit backup mode. The output is a directory or tar file that can be used to start a new Postgres instance, either as a recovered cluster or as a standby that will continue to apply WAL from the primary.
The mechanism that makes this work without blocking writes is non-exclusive backup mode, added in Postgres 9.6. The primary writes a backup label file marking the start of the backup, the tool copies the data directory while writes continue (so individual files may be inconsistent), and the primary records the WAL position that corresponds to the start. To make the copied data directory consistent, the recovering instance replays WAL from the recorded start position forward until it reaches a consistency point.
The WAL files themselves have to be captured too, either by including them in the backup via the --wal-method=stream option, or by relying on continuous WAL archiving running independently. The stream option is simpler and is the right default for one-shot backups; the archive option is the right default for production deployments where WAL is being shipped to long-term storage continuously and a base backup is one component of a point-in-time recovery setup.
The replica-provisioning use case
The most common use of pg_basebackup is provisioning a new streaming replica. The pattern is: stop the (empty) replica, run pg_basebackup pointing at the primary, configure the replica's primary_conninfo to point at the primary, start the replica, and watch as it catches up via streaming replication. With the --write-recovery-conf option (Postgres 12+) the tool writes the standby.signal file and a postgresql.auto.conf entry for primary_conninfo, removing one of the manual steps that used to bite operators.
The bottleneck for large clusters is network and disk throughput. pg_basebackup is single-threaded by default, which limits throughput to roughly what one core and one TCP connection can sustain. For multi-terabyte clusters this can mean overnight backup windows. The --jobs option enables parallel backup workers, but only with directory format and only on Postgres 13+; on older versions or tar format the single-threaded limit applies.
The streaming-replica setup also depends on having enough WAL retained on the primary for the replica to catch up. If the primary advances past the WAL position the replica needs while pg_basebackup is still running, the replica will fail to start. The mitigation is either wal_keep_size (or the older wal_keep_segments) set generously, or a replication slot reserved on the primary before backup starts. A replication slot is the cleaner answer but introduces its own failure mode: a forgotten slot will pin WAL retention indefinitely and eventually fill the disk.
The point-in-time recovery use case
The other primary use is creating the base for point-in-time recovery. The pattern is: take a base backup, archive WAL continuously to long-term storage (via archive_command), and on disaster, restore the base backup, configure restore_command to fetch archived WAL, set a recovery_target_time, and start the cluster in recovery mode. The cluster will replay WAL up to the target time and stop.
The base backup frequency is a trade-off between storage cost and recovery time. A base backup taken weekly with WAL archived continuously means a worst-case recovery has to replay roughly a week of WAL, which can take hours for write-heavy clusters. A nightly base backup reduces replay time but increases storage. The right answer depends on your RTO and storage budget.
The verification discipline is non-optional: a base backup is only as good as the most recent restore drill, and restore drills should happen on a schedule, not in response to disasters. The drill is: take a recent base backup and recent WAL, restore to a separate environment, run smoke tests against the restored cluster. Backups that have never been restored have a high probability of being broken in ways that only restoration reveals.
The operational details
Three details that the documentation underweights:
The --max-rate option throttles the backup to avoid saturating network bandwidth or disk I/O on the primary. On a busy production primary, an unthrottled base backup can produce measurable application latency impact. A typical setting is half the available bandwidth, which doubles backup duration but keeps the primary responsive.
The -X stream method opens a second connection to the primary to stream WAL during the backup. This means the primary needs at least two replication-protocol connections available, which interacts with the max_wal_senders setting. If max_wal_senders is too low, pg_basebackup fails partway through with a confusing error about replication slot availability.
The backup includes every file in the data directory, including any temporary files that happen to exist during backup. Postgres handles this correctly on restore, but the storage cost can be surprising. A primary with substantial work_mem-driven disk spill at backup time will produce a base backup containing those temporary files.
What pg_basebackup does not do
pg_basebackup is not a logical backup tool. It produces a physical copy of the cluster files, which means the restored cluster has to be the exact same major version of Postgres as the source. Cross-version restores require pg_dump and pg_restore (or pg_upgrade for in-place upgrades).
pg_basebackup is also not by itself a complete backup solution. Without continuous WAL archiving, a base backup represents only the state at the moment of backup completion; any writes since then are lost. The combination of base backup plus WAL archive is what enables point-in-time recovery; either alone is insufficient.
And pg_basebackup does not handle external dependencies. If your cluster has tablespaces on separate volumes, those have to be backed up too (which pg_basebackup handles via --tablespace-mapping). If your cluster depends on extensions with state outside the data directory, that state has to be backed up separately. Most extensions store their state inside the data directory; some (like pg_repack) leave artifacts that need to be handled explicitly.
Our use across products
Our four products (DocuMint, CronPing, FlagBit, WebhookVault) currently run on SQLite, where the equivalent operation is a simple file copy with WAL checkpointing. The Postgres migration plan includes pg_basebackup as the foundation of the backup strategy, with WAL archiving to off-site storage and weekly restore drills. The drill discipline is the part that matters most: every other piece of the strategy is documented in the Postgres manual, and the part that distinguishes teams that have working backups from teams that think they have working backups is whether anyone has actually restored one recently.
Deeper observation
Tools like pg_basebackup are foundational in the sense that they are required for any serious operational story, and almost invisible in the sense that they only matter on the days when something has gone wrong. The right time to learn the operational details is when nothing is on fire. The wrong time is during an incident where every minute of recovery counts and the documentation you should have read months ago is now the only thing standing between you and explaining to customers why their data is gone.