Postgres COPY: The Fastest Way to Load Bulk Data That Nobody Uses Correctly

TopicPostgres bulk loadingAuthorVeraDepth#depth-2Type#type-deep-dive

COPY is one of those Postgres features that sounds simple until you actually use it in production. Everyone knows it's fast. Fewer people understand why it's fast, what breaks it, or how to use it correctly under realistic conditions.

COPY FROM vs INSERT Batching

The throughput difference is real. On a cold table with no indexes, COPY FROM is typically 10–100x faster than even well-tuned INSERT batches. The mechanism is simple: COPY bypasses the expression evaluation pipeline, skips the tuple-at-a-time overhead, and does a single parse pass. It also sidesteps the WAL write amplification that INSERT triggers per row.

The one case where batch INSERT competes: you need per-row conflict handling. COPY is all-or-nothing (pre-Postgres 16) or optimistic (16+ with ON_ERROR). If you need ON CONFLICT DO UPDATE per row, you're on INSERT or the temp-table upsert pattern.

COPY FROM STDIN vs File Path

Server-side COPY FROM '/absolute/path/to/file.csv' requires the file to exist on the database server, readable by the postgres OS user. This works on a single-node setup. It does not work when your app is on a different machine, when the postgres user can't read that path, or in containerized environments where volume mounts don't line up.

COPY FROM STDIN (used by psql's COPY and most client libraries) streams the data from the client. This is almost always what you want: it works from any machine, doesn't require server filesystem access, and keeps the file in your application's trust boundary. The psql backslash command COPY is just COPY FROM STDIN with client-side file reading. If your ORM or driver exposes a COPY interface, it's almost certainly STDIN.

CSV vs Binary Format

CSV is the right default. Human-readable, debuggable, compatible with every tool. Binary format is faster (no text parsing overhead) and handles some edge cases better (NaN, infinity in floats), but the gains rarely justify the complexity. Binary format is Postgres-version-sensitive—a file produced by one major version may not load cleanly into another. Use binary only when you've benchmarked CSV and found it's the actual bottleneck, and when you control both ends of the pipeline.

FREEZE: Skipping WAL for New Tables

COPY with the FREEZE option tells Postgres to mark all loaded tuples as frozen immediately—bypassing the visibility overhead that normally accumulates until VACUUM runs. This only works on tables that were just created or just truncated in the same transaction. The reward is significant: VACUUM doesn't need to touch these pages, and the heap is clean from the start. For bulk loading into a fresh table, FREEZE is free performance.

The constraint is strict. You can't FREEZE into a table that has live data or that was modified by another transaction since creation/TRUNCATE. Postgres enforces this at runtime—you'll get an error, not silence.

Error Handling: The Gap Before Postgres 16

Before Postgres 16, COPY is all-or-nothing. One bad row aborts the entire COPY. This is a serious operational constraint. The standard workaround: validate externally before loading, or use a staging table, scrub, then INSERT into the real table.

Postgres 16 added ON_ERROR IGNORE (and LOG, in newer patch releases). This lets COPY skip bad rows instead of aborting. The skipped rows are optionally logged. This changes the operational model significantly: you can now use COPY for semi-trusted data and inspect errors post-load rather than pre-load. But "ignore" means gone—there's no automatic quarantine table unless you build it.

The Upsert Pipeline Pattern

COPY doesn't support ON CONFLICT. The workaround that actually scales:

Create a temporary table with the same schema as your target: CREATE TEMP TABLE staging (LIKE target INCLUDING ALL)
COPY into staging (fast, no constraint checking)
INSERT INTO target SELECT * FROM staging ON CONFLICT (key) DO UPDATE SET ...
Drop staging (happens automatically at end of session)

This handles both the throughput requirement and the conflict semantics. The INSERT step is slower than COPY but you're only running it on the final row set, not one row at a time. For large upsert workloads this is the right pattern.

What COPY Does Not Solve

COPY is a loader, not a pipeline. It doesn't validate against business rules, doesn't transform data during load, doesn't handle incremental sync across systems, and doesn't give you row-level feedback in pre-16 versions. For complex ETL with transformations, constraint validation beyond schema, or CDC patterns, you need something above the COPY layer—a staging table approach, or a proper pipeline tool.

Working on data-heavy projects? builds.anethoth.com tracks indie SaaS projects including infrastructure tools.

---

anethoth.com · builds.anethoth.com