Designing API Bulk Delete Endpoints: Patterns for Irreversible Operations at Scale

Most APIs make single-item delete safe and bulk delete an afterthought. The patterns that survive customer use require thinking about partial failure, idempotency, and the impossibility of undo.

Bulk delete is one of the most dangerous endpoints in any API. A single request can remove thousands of records in a way that is hard or impossible to reverse, and the failure modes are different from any other bulk operation: a partial failure leaves the customer in an unknown state, an accidental wildcard can wipe out data the customer cared about, and the audit trail is critical because the action is otherwise opaque. Most APIs we have looked at treat bulk delete as an afterthought, building it as a convenience wrapper around single-item delete without thinking through the unique problems that bulk irreversible operations create.

The single-item delete story

Single-item delete is well-understood. The endpoint takes a resource ID, the server deletes the resource, the response confirms what was deleted. The semantics are typically: 204 No Content on success, 404 on not-found, 409 if the resource is in a state that cannot be deleted (still referenced, still active), with idempotency by-resource-ID meaning the same delete can be retried safely because the second call returns 404 or 204 depending on implementation.

The bulk version of this is not just a list of single deletes. The composition of single-item semantics under partial failure produces a different operation with different requirements. A customer who calls bulk delete with 100 IDs and gets back "we deleted 87 of them" needs to know which 87, because retrying the failed ones requires knowing which ones failed.

The minimum viable bulk delete

The simplest bulk delete that works in production accepts a list of IDs, processes them one at a time, returns a per-item result. The schema is:

POST /api/v1/widgets/bulk_delete
{
  "ids": ["widget_abc", "widget_def", "widget_xyz"],
  "idempotency_key": "customer_supplied_uuid"
}

200 OK
{
  "deleted": [
    {"id": "widget_abc", "status": "deleted"},
    {"id": "widget_def", "status": "deleted"},
    {"id": "widget_xyz", "status": "not_found"}
  ],
  "summary": {
    "requested": 3,
    "deleted": 2,
    "not_found": 1,
    "failed": 0
  }
}

The response status is 200 even when some items failed, because the bulk operation itself succeeded; the per-item status communicates the granular outcome. Returning 207 Multi-Status is technically correct per RFC 4918 but is unfamiliar to most HTTP clients and is often handled incorrectly by middleware, so 200 with structured per-item results is more practical.

The summary block is for customers writing dashboards or alert logic that needs aggregate numbers without parsing the per-item array. Include it from the start; adding it later is a backward-compatible change but customers who built parsers around the per-item array now have to update them to get the summary.

Per-batch vs per-item idempotency

The idempotency story is more subtle than for create operations. For bulk delete, the right granularity is usually per-batch: the entire batch with a given idempotency key returns the same response whether called once or many times. Returning the cached response on retry, including the per-item statuses, lets the customer's retry logic make progress without needing to remember which items succeeded.

The alternative of per-item idempotency (idempotent by (resource_id, batch_id) pairs) is more complex to implement and rarely earns its cost. The bulk delete is a transient operation; if it failed partially, the customer can retry with the IDs that failed, and per-batch idempotency handles the case where the original response was lost in transit.

The schema for per-batch idempotency is a table keyed on (account_id, idempotency_key) storing the full response. The TTL should be at least 24 hours to handle network partitions and 72 hours to handle weekend incidents. The response cache should include the HTTP status code and headers, not just the body, because customers' retry logic may key on status code.

The wildcard problem

Most bulk delete bugs we have investigated involve someone passing the wrong filter and deleting more than they intended. The defensive design is to refuse wildcards by default and require explicit confirmation for large deletes.

The minimum protection is a count threshold: if the bulk delete would affect more than N records (we use 100 as default, configurable per account), require a confirmation parameter that the customer has to set explicitly. The confirmation can be as simple as ?confirm=true or as elaborate as a server-generated token from a preview endpoint, depending on how cautious you want to be.

A stronger protection is to refuse filter-based bulk delete entirely, requiring explicit ID lists. This is annoying for some legitimate use cases (delete everything matching a query), but those use cases are rare and the customer can implement them client-side by listing then deleting. The trade-off is between API surface convenience and the blast radius of misuse.

A third option is soft delete with a recovery window. The bulk delete marks records as deleted but keeps them recoverable for some period (30 days is conventional), with a separate hard-delete operation that runs after the window. This is what most consumer-facing systems do because the cost of recovery is much lower than the cost of explaining to a customer that their data is gone forever. For API products serving developers who explicitly want delete to mean delete, the soft delete is sometimes inappropriate; for API products serving end-users via the developer's application, soft delete is often the right default.

Audit trail requirements

Every bulk delete operation should produce an audit log entry capturing: who initiated it (account ID, API key ID, IP address), when, what was requested (the IDs or filter), what was deleted (the IDs actually removed), and what failed (the IDs that returned errors and why). The audit log is retained longer than the affected records, which is the point: when a customer asks "what happened to my widgets," the answer needs to exist.

The right place to write the audit log is in the same transaction as the actual delete, using either a same-database append-only audit table or the transactional outbox pattern if the audit log lives in a separate system. Writing the audit log after the delete in a separate transaction means a crash between the two leaves the audit log incomplete.

The audit log retention should be longer than the data retention. If user data has 30-day soft-delete and 90-day hard-delete, the audit log should retain delete records for at least a year, ideally seven years for compliance use cases. The volume is small (a few KB per delete operation, regardless of how many records were deleted) and the support value is high.

Rate limiting and quotas

Bulk delete should count against item quotas, not request quotas. A customer who calls bulk delete with 100 IDs is doing 100 deletes worth of work, and the rate limiter should treat it that way. Counting it as one request leaves room for a customer to wipe out their entire account in a few requests, which is exactly the failure mode the rate limit is supposed to prevent.

The implementation is to consume items from the rate limit bucket equal to the number of IDs in the request before processing, returning 429 if the customer does not have capacity. The 429 response should include the standard rate-limit headers plus a hint about how many items the bucket has remaining, so the customer can split the request appropriately.

Synchronous vs asynchronous

Below some threshold (we use 100 items as default), the bulk delete should be synchronous, returning the full response in the same HTTP round-trip. Above the threshold, asynchronous makes sense: return 202 Accepted with a job ID, process the deletes in the background, expose status via a GET on the job ID, optionally fire a webhook on completion.

The cutover threshold depends on how fast deletes are. For most APIs, 100-500 deletes per second per worker is achievable, so a synchronous limit of 100-200 items keeps response times under a second. Above that, the customer is better served by an async pattern that gives them progress reporting and avoids HTTP timeouts.

The async pattern requires a bulk_jobs table tracking status, an idempotency mechanism for retries of the job creation, and a worker pool processing claimed jobs with the SKIP LOCKED claim pattern. The complexity is real but the alternative (synchronous bulk operations that occasionally hang for minutes) is worse.

What does not work

Three patterns we have seen and rejected:

Single response code without per-item detail. A bulk delete that returns 200 OK with no body, or with a count but no per-item statuses, makes it impossible for customers to know which items failed. This is the most common bulk delete anti-pattern and produces the most support tickets.

404 on the bulk endpoint if any item is missing. Treating the bulk operation as atomic means any partial failure cancels the whole batch, which is the wrong default. Customers usually want best-effort with detailed reporting, not atomic-or-nothing. Atomic semantics should be an opt-in via a transaction parameter, not the default.

Wildcard filter without confirmation. Allowing DELETE /widgets?status=archived to delete every archived widget without any confirmation is the source of more catastrophic delete incidents than any other API pattern. The mitigation is either explicit confirmation, a count threshold, or refusing filter-based bulk delete entirely.

Our use across the four products

DocuMint, CronPing, FlagBit, and WebhookVault all expose bulk delete endpoints for the resources where customers asked for them. CronPing has bulk delete for monitors, FlagBit for flags and rules, WebhookVault for endpoints and captured requests, DocuMint for templates. The implementation pattern is consistent: 200 OK with per-item statuses, per-batch idempotency by (account_id, idempotency_key), count-threshold confirmation above 100 items, full audit trail, items counted against item quotas, synchronous below 200 items and async above.

The deeper observation is that bulk delete is the operation where most APIs reveal whether their designers have thought about partial failure and irreversibility. The design choices that look small (response code, idempotency granularity, audit trail depth) determine whether customers can recover from mistakes, whether support can answer their questions, and whether a single misclick can take down an account. The discipline is treating bulk delete as a category distinct from single-item delete, with its own design requirements derived from the unique combination of multi-item semantics and the impossibility of undo.

Read more