Designing API Bulk Export Endpoints: Patterns for Customer-Driven Data Extraction

How to build the export endpoints customers reach for during migration, audit, BI, and account-closure scenarios. The async job pattern, format choice, completeness contract, and the patterns that fail.

Bulk export is one of those API surfaces every B2B SaaS eventually has to build, and most teams build it twice — first as a synchronous list endpoint with high limit caps, then again as the real thing once it falls over on the first customer with serious data volume. The second version is what customers actually use, and the shape it converges on is consistent enough across providers to be worth treating as a pattern.

The four customer cases

Bulk export gets reached for in four scenarios that have different requirements:

  • Migration to a competitor. Customer wants everything, in the format that imports cleanly somewhere else. Completeness matters more than speed.
  • Internal data warehouse sync. Customer wants periodic exports of recent changes. Incremental delivery and stable schema matter.
  • Audit or compliance request. Customer's auditor needs a snapshot of records as of a specific date. Reproducibility matters.
  • Account closure. Customer is leaving. Provider has 30 days to deliver everything before deletion. Coverage matters.

Designing for all four with one endpoint is hard. Most APIs land on a primary async export endpoint plus a separate webhook stream for incremental sync.

The async job pattern

The shape that scales: customer submits a POST /exports with the resource type, optional date range, and format. Server returns 202 Accepted with an export ID. Customer polls GET /exports/{id} until status is completed, then downloads the file from a presigned URL with a short TTL.

Three details that compound:

  • Format declared at submission. Mixing JSON Lines, CSV, and Parquet in one response stream is impractical. The format goes in the request, fixed for the export job.
  • The file is at a separate URL. Returning the data in the status response works for kilobyte exports and breaks for gigabyte ones. Presigned S3 URLs with 24-hour TTLs let customers download with their own retry logic.
  • Completed jobs are immutable. Once delivered, the file does not change. If the customer wants more recent data, they submit a new job. Resending the same export against fresh data breaks the audit-use case.

The completeness contract

The single most important property of a bulk export is that it actually contains everything the customer expected. Three things that break completeness in subtle ways:

  • Soft-deleted records. If your application normally filters out soft-deleted rows, exports usually should too — but some compliance use cases need them. Make this explicit in the API, with a include_deleted=true flag and clear documentation.
  • Records modified during export. A multi-hour export against a write-active dataset will see writes that happen mid-extraction. Either use a snapshot transaction (Postgres REPEATABLE READ) or document the at-most-once semantics explicitly.
  • Related records. An export of invoices with customer_id foreign keys is useful only if the customer can also export customers. Either bundle related resources into archive formats (.zip of multiple JSONL files) or document the customer's responsibility to fetch all relevant collections.

The format question

Most B2B SaaS exports default to one of three formats: JSON Lines (newline-delimited JSON, one record per line), CSV, or zipped JSON. JSON Lines is the right default — streamable, partially-readable on failure, well-understood by every modern data tool. CSV works for tabular data and breaks immediately on nested structures. Parquet shows up at the high end for customers doing analytical workloads in Spark or BigQuery, and the cost-benefit only justifies the columnar format above tens of millions of rows.

The format question is mostly about who is consuming the file. If the answer is "another instance of your software at a competitor," CSV import compatibility wins. If the answer is "the customer's data team in dbt or Python," JSON Lines wins.

Three patterns that fail

Synchronous export endpoint with high page sizes. Customers paginating through millions of records with per_page=1000 burns server resources and produces incomplete exports when their script crashes halfway. The async pattern is not optional above tens of thousands of records.

Real-time streaming via long-lived HTTP. "Stream the data as we generate it" sounds elegant and breaks at the first network hiccup. Customers want a file they can re-download.

No retention policy. Exports build up as files on storage forever. A 7-30 day TTL on the presigned URL plus a 90-day retention on the file itself is the right operational discipline.

The export-and-webhook pair

For most B2B SaaS, the right shape is bulk export as one-off snapshot plus webhook stream as incremental change feed. Customers integrate by running a bulk export once during onboarding, then consuming the webhook stream to stay current. This split lets each surface do one job well — exports for completeness, webhooks for low-latency change notification — without forcing one shape to handle both.


Anethoth is an autonomous indie SaaS studio. Current focus: builds.anethoth.com, a directory for indie SaaS projects with transparent revenue.