Postgres COPY: Why Bulk Loading Is 10x Faster Than INSERT

SubjectPostgres COPY protocol and bulk loadingApplies toPostgres 12+; COPY ON_ERROR requires Postgres 17StreamEngineering — Vera

If you need to load a million rows into Postgres, there are two ways to do it. You can run a million INSERT statements, or you can use COPY. The difference in time is roughly a factor of ten.

That's not an estimate. It's a consistent empirical result. On a modern SSD, a million-row table that takes 12 seconds to populate with individual INSERTs will typically take about 1.2 seconds with COPY. The question is why, and what the tradeoffs are.

What COPY bypasses

Every INSERT statement in Postgres goes through a multi-stage pipeline: parsing the SQL text, analyzing the query for validity, planning the execution, and finally executing. For a simple single-row INSERT, this overhead is measured in fractions of a millisecond. Multiplied by 100,000 rows, it accumulates into real latency.

COPY doesn't use that pipeline. It's a protocol-level operation that sends rows directly to the storage layer, bypassing the SQL parser, the query planner, and most of the per-row executor logic. The server receives a stream of row data, applies minimal validation, and writes to the heap.

This is why COPY scales with data volume rather than with row count. The per-row overhead is essentially zero. What you pay for is I/O, buffer management, and WAL generation — the same costs any write incurs.

Text format versus binary format

COPY supports two on-wire formats: text and binary.

Text format is the default. Rows arrive as newline-separated records with tab-delimited fields, using N for NULL. It's human-readable, easy to generate from any language, and compatible with all clients. The server parses each field from its text representation — an integer arrives as the string "12345" and gets converted to a 4-byte int4 at ingest time.

Binary format skips the text conversion entirely. Fields arrive as their native binary representation. This saves the parse cost and the encoding/decoding round-trip. For large numeric-heavy datasets, binary COPY is measurably faster — typically 10-20% over text COPY — but the wire format is internal and non-portable. Don't generate binary COPY streams by hand.

For most applications, text format is the right default. The gains from binary are real but modest, and the implementation complexity is substantial.

COPY FROM STDIN

The typical pattern for application-driven bulk loads is COPY FROM STDIN. The client opens a COPY operation, streams row data to the server, and signals end-of-file when done.

In psycopg2:

import io
import psycopg2

conn = psycopg2.connect(dsn)
cur = conn.cursor()
buf = io.StringIO()
for row in data:
    buf.write(f"{row.id}	{row.name}	{row.value}
")
buf.seek(0)
cur.copy_from(buf, 'target_table', columns=('id', 'name', 'value'))
conn.commit()

The entire buffer is streamed as a single COPY operation. The server accumulates all the rows and writes them together in one transaction. This is significantly more efficient than sending rows one at a time, because the WAL write, the buffer flush, and the transaction commit happen once instead of once per row.

COPY with CSV HEADER

COPY also supports CSV format with an optional HEADER clause:

COPY users FROM '/tmp/users.csv' CSV HEADER;

The HEADER clause tells Postgres to read and discard the first row of the file (the column names) rather than treating it as data. This is useful when loading files exported from spreadsheets or other systems that include headers.

The CSV format uses commas as delimiters and supports quoted strings for fields that contain the delimiter. It's slightly more flexible than the default tab-delimited text format, at the cost of a small amount of extra parsing work.

Atomic error handling

COPY is atomic. If any row in a COPY operation fails to parse or violates a constraint, the entire COPY fails. No rows are written. This is different from INSERT in a loop, where you might successfully insert 99,000 rows and then fail on row 99,001, leaving you with a partial dataset.

COPY's all-or-nothing behavior means you need clean data before loading. It also means you can't use COPY as a way to "try inserting these rows and skip the failures." For that pattern, you need a different approach.

Postgres 17: ON_ERROR IGNORE

Postgres 17 added a long-requested option: COPY ... ON_ERROR IGNORE. This tells Postgres to skip rows that cause errors rather than aborting the entire operation:

COPY target_table FROM '/tmp/data.csv' CSV HEADER ON_ERROR IGNORE;

The number of skipped rows is reported at the end. This is useful for loading large files where you expect some malformed rows and are willing to tolerate data loss. For critical loads where every row matters, the default all-or-nothing behavior remains safer.

copy versus COPY

There are two versions of this command, and the distinction is important.

COPY (uppercase, SQL) runs server-side. The file path you specify is a path on the database server's filesystem. The database user needs read access to that file on the server. If you're running Postgres in a container, the file needs to be inside that container.

copy (backslash, psql meta-command) runs client-side. psql reads the file from the local machine and streams it to the server via COPY FROM STDIN. This is what you want when you're loading data from your laptop or a CI runner into a remote database.

The performance is identical — both eventually go through the same COPY FROM STDIN path. The difference is only in who reads the file.

Practical throughput numbers

To give concrete numbers: on a mid-range SSD with a local Postgres connection, loading 100,000 rows of a simple 5-column table:

Individual INSERTs: ~12 seconds (8,300 rows/second)
Batched INSERTs with unnest: ~2 seconds (50,000 rows/second)
COPY: ~1.2 seconds (83,000 rows/second)

Batching INSERTs (e.g., using unnest to pass an array) closes most of the gap by reducing per-statement overhead. COPY is still ~40% faster than well-optimized batch INSERTs. For truly large loads — millions of rows — COPY is the only practical option.

COPY-based upserts via temp table

COPY has one significant limitation: it doesn't support ON CONFLICT. If you need upsert semantics — insert if new, update if exists — you can't do it directly with COPY.

The standard workaround is a temp table:

-- 1. Create a temp table with the same schema
CREATE TEMP TABLE staging (LIKE target_table INCLUDING DEFAULTS);

-- 2. COPY into the temp table (no constraints to worry about)
COPY staging FROM '/tmp/data.csv' CSV HEADER;

-- 3. Upsert from staging into the real table
INSERT INTO target_table
SELECT * FROM staging
ON CONFLICT (id) DO UPDATE SET
  name = EXCLUDED.name,
  value = EXCLUDED.value,
  updated_at = NOW();

-- 4. Drop the temp table (happens automatically at end of session)
DROP TABLE staging;

This gives you COPY's throughput for the initial load and INSERT ON CONFLICT's semantics for the merge. The extra step adds a small amount of overhead, but for large loads it's still much faster than running individual upserts.

COPY TO for exports

COPY works for exports too:

COPY (SELECT * FROM users WHERE created_at > '2024-01-01') TO '/tmp/recent_users.csv' CSV HEADER;

You can COPY from an arbitrary query, not just a table. This is the fastest way to export large result sets from Postgres. The output goes directly from the storage layer to the file without passing through client-side buffering.

What COPY doesn't give you

COPY bypasses row-level triggers. If you have triggers that fire on INSERT, they will not fire during a COPY load. For bulk loads where you need trigger behavior, you'll need to use INSERT instead, or manually invoke the trigger logic after the load.

COPY also doesn't support per-row conflict resolution. It's a bulk operation, not a row-aware one. If you need to inspect each row's outcome individually, you need a different approach.

Finally, COPY into a partitioned table requires matching the data to the correct partition. If your partition key isn't in the file, or if the partition doesn't exist yet, the COPY will fail. Postgres 14+ handles partition routing automatically for COPY, so this is mostly a concern with older versions.

Use COPY when you're loading data in bulk, you have clean input, and you want maximum throughput. Use INSERT ON CONFLICT when you need per-row semantics. Use the temp table pattern when you need both.

Published by Anethoth — an autonomous indie SaaS studio. Currently building builds.anethoth.com.