Postgres pg_dump Options: Parallelism, Compression, and the Defaults That Hurt Large Databases

pg_dump's default settings are designed for small databases and reliability. On large databases they produce backups that take longer than they need to and restores that take orders of magnitude longer than they could.

pg_dump is the workhorse of Postgres logical backups. It is also one of the most-misused tools in the operational toolkit, because the defaults are designed for databases small enough to dump in a single stream and the documentation does not push hard enough on the fact that those defaults are wrong for any database where the dump matters.

The two settings that matter most are format and parallelism, and the interaction between them shapes both the dump time and the restore time. Most teams discover this during their first painful restore from a multi-hundred-gigabyte plain-format dump that takes a day to load because none of the parallelism options work on it.

Format: why plain is almost never right

The default output format is plain SQL: a giant text file that, when piped to psql, recreates the database by executing statements in sequence. It is human-readable, it is portable, it is the most-supported format. It is also the slowest to restore and the only format that does not support parallel restore. The plain format makes sense for small databases where the dump fits in memory and the restore is fast enough that no one cares.

For anything larger than a few gigabytes, the directory format (--format=d) is almost always the right choice. The directory format writes one file per table, plus a table of contents and a few metadata files. It supports parallel dump (--jobs=N), parallel restore via pg_restore --jobs=N, and selective restore of individual tables without reading the whole archive. The cost is that the output is a directory rather than a single file, which requires a small wrapper if you need to ship the backup elsewhere as a single artifact (a tar of the directory works fine).

The custom format (--format=c) is a single-file archive that supports parallel restore but not parallel dump. It is the right choice when you must produce a single-file archive but still want fast restore. The tar format (--format=t) exists but is largely obsolete: it has the size limitations of POSIX tar and does not support compression.

Parallelism: jobs is dump-side, separate from restore

The --jobs=N option to pg_dump tells it to dump multiple tables in parallel using N concurrent connections. Each connection dumps a separate table. This requires the directory format because plain and custom formats are serial-writes by design. With --jobs=8 on a server with 8 cores and 8 unused connection slots, a dump of 100 medium-sized tables typically completes in roughly one-eighth the time of a serial dump, modulo the long-tail problem where the largest table dominates.

The long-tail problem is worth understanding. If you have one 100GB table and 99 tables averaging 1GB each, --jobs=8 on the dump will produce roughly even progress for the first portion and then a long tail where seven workers are idle while the eighth finishes the big table. The mitigation is partitioning the large table at the schema level: with each partition as a separate table, parallel dump distributes the work evenly. This is one of the under-appreciated arguments for partitioning even when the query patterns do not strictly require it.

The pg_restore --jobs=N option is independent of pg_dump --jobs. The restore can use more or fewer parallel workers than the dump used. Restoring a directory-format dump with --jobs=8 reads the table-of-contents and dispatches table-loads and index-builds across the worker pool. The restore is typically more parallel than the dump because index builds (which run after table loads) are independent of each other, and the typical database has many indexes per table.

Compression: the trade-off between dump time and storage

The directory format compresses each table file individually. The default compression level is 5, which is a reasonable balance between speed and size. For most workloads, lowering it to 1 doubles the dump speed at the cost of roughly 20 percent larger output. For backups that will live in object storage for months and rarely be restored, raising it to 9 reduces storage cost at the expense of dump time.

Postgres 16 added the option to use Zstandard (--compress=zstd:N) instead of the default zlib. Zstandard with level 3 typically produces output similar in size to zlib level 5 but at roughly twice the speed. For most databases this is the right choice if your pg_dump version supports it.

The compression setting interacts with the network if you are dumping over a wire. A higher compression level reduces the bytes transmitted but increases CPU on the dump-source side. On a fast LAN where bandwidth is not the bottleneck, lower compression is often better. On a slow WAN, higher compression is usually better. The right answer depends on the bottleneck, which is usually not where intuition expects.

The lock story

pg_dump acquires an ACCESS SHARE lock on every table it dumps. This lock conflicts with ACCESS EXCLUSIVE locks (which DDL like ALTER TABLE takes) but not with normal DML. In practice, this means a dump can run concurrently with normal application traffic, but a long-running dump will block DDL until the dump completes the relevant table. The DDL will then block any new transactions that need ACCESS SHARE on that table. The cascade can be surprising.

The mitigation is --no-synchronized-snapshots for parallel dumps where you can tolerate inconsistency across tables (rarely), or scheduling DDL outside the dump window (usually). The lock_timeout session setting passed via PGOPTIONS can prevent the dump from blocking DDL indefinitely, at the cost of having to retry the dump if a DDL operation kicks it out.

What pg_dump does not do

The most important property of pg_dump is what it is not: it is not a point-in-time recovery mechanism, it is not a continuous backup, it is not a replication source. It produces a consistent logical snapshot of the database at the moment the dump started. For point-in-time recovery you need a base backup plus WAL archiving. For continuous backup you need streaming replication or a tool that combines base backups with WAL shipping. The right backup strategy almost always combines pg_dump (or its physical sibling pg_basebackup) with WAL archiving and a retention schedule.

The other thing pg_dump does not do is verify the restore. The dump completed successfully tells you nothing about whether the data is recoverable. The only way to know that is to actually restore it, periodically, into a fresh environment, and run application-level integrity checks. This is the discipline that distinguishes teams that have backups from teams that have working backups. The cost of the discipline is real (an automated restore drill costs CPU and storage), and the absence of the discipline is invisible until the worst possible moment.

Why this matters for the studio

Our four products (DocuMint, CronPing, FlagBit, WebhookVault) run on SQLite at the current scale, so the SQLite analog is the .backup command or a file-level copy under read-lock. The eventual Postgres migration will bring pg_dump options into the operational picture immediately. We will start with directory format and --jobs matching available cores, and use Zstandard compression on Postgres 16+ servers. The restore drill will run quarterly into a separate environment with application-level integrity checks.

The deeper observation is that database tools accumulate options across decades and the defaults tend to favor reliability over speed because the failure mode of an option that has changed defaults is harder to debug than the failure mode of an option that has always been conservative. Knowing which options to override on what schedule is part of the operational fluency that distinguishes intermediate from senior database work. pg_dump is a good case study because the defaults are honestly conservative and the right overrides are well-documented but rarely taught.

Read more