Designing API Soft Delete: Undo Windows, Hard Delete Triggers, and the Customer Recovery Surface

Soft delete is one of the small design choices that compounds. Get it right and customer recovery is a routine surface. Get it wrong and you spend years restoring from backups for individual mistakes.

Soft delete is one of the small API design choices that compounds. When a customer deletes a resource through your API, the question is whether you remove it immediately, whether you mark it deleted and remove it later, or whether you keep it indefinitely with an explicit recovery surface. The answer shapes the customer-recovery experience, the storage cost, the compliance posture, and the operational complexity of the system for the entire lifetime of the product. Get it right and customer recovery is a routine self-service surface. Get it wrong and you spend years restoring from backups for individual mistakes.

The case for soft delete

The basic argument for soft delete is that customers make mistakes. They click the wrong button, they run the wrong script, they fire an employee whose work is now irretrievable. The cost asymmetry is steep: the customer's cost of the deletion being recoverable is small, the customer's cost of the deletion being permanent is potentially enormous. The vendor's cost of providing recovery is moderate. Asymmetric cost is the canonical case where the right default favors the customer.

The cases where hard delete is correct are narrow but real. Deletion-for-compliance reasons (GDPR right-to-be-forgotten, PCI-DSS card data retention limits) requires actual removal of the data, not just marking it as deleted. Deletion-of-sensitive-credentials (API keys, webhook signing secrets) needs to be immediate because the leaked-key threat model does not respect retention windows. Some bulk operations on transient data (CronPing ping events older than retention, WebhookVault requests older than retention) are functionally soft-deleted-then-hard-deleted on a schedule, and the customer-facing surface treats them as gone immediately.

The minimum viable schema

The right schema for soft delete is a nullable deleted_at timestamp column, not a boolean. A timestamp tells you both whether the row is deleted and when, which makes the recovery window enforceable and the cleanup job straightforward. A boolean forces you to add a separate timestamp later, which means a migration on a large table.

Every query that returns customer-visible data needs to filter WHERE deleted_at IS NULL. This is the discipline that breaks down most often. The defenses against forgotten filters include making it the default in your ORM or query builder, building integration tests that explicitly call the listing endpoints after a soft delete and assert the deleted row is not returned, and code review checklists for any new query against soft-deletable tables. Row-level security in Postgres is another option that pushes the discipline into the database, at the cost of operational complexity that most small teams do not need.

The recovery surface

Soft delete is only useful if customers can recover. The minimum recovery surface is an endpoint that lists soft-deleted resources within the recovery window and an endpoint that restores them. The list endpoint takes filters on type and deletion date and returns paginated results with the same shape as the regular list endpoint plus a deleted_at field. The restore endpoint takes a resource ID and either restores it (clearing deleted_at) or returns 404 if the recovery window has elapsed.

The recovery window is one of the load-bearing decisions. Common values are 7 days for routine operational data, 30 days for important data, 90 days for compliance-driven cases. We use 30 days as the default across our products. Shorter than 7 is rarely worth the storage savings; longer than 90 starts to interact with compliance requirements in ways that require legal review.

The dashboard surface complements the API. Customers find the recovery surface through the dashboard when they make a mistake, not through reading the API docs. A "Recently deleted" section in each resource view with a one-click restore button is the right pattern. The corresponding API endpoint matters because tooling and integrations need it, but the dashboard is where the value materializes for most customers.

The hard delete trigger

Soft-deleted rows do not stay around forever. The hard delete trigger is a scheduled job that removes rows whose deleted_at timestamp is older than the recovery window. The implementation is straightforward but has subtleties.

The cleanup needs to be chunked. A single DELETE of all expired rows can hold locks for minutes and block other operations. We chunk in groups of 1000-5000 rows, with a small sleep between chunks, and a per-table rate cap. The cleanup runs nightly at low-traffic hours.

The cleanup needs to be aware of foreign key relationships. Soft-deleting a parent row does not soft-delete the children. The choices are cascading soft delete (which is rarely what customers want because they often want to restore the parent without restoring the children), independent recovery (which is what we do; parent and children have independent deleted_at timestamps), or refusing to soft-delete the parent if children exist. We use independent recovery with a UI that shows the relationship.

The cleanup needs to handle the compliance overlay. When a customer invokes GDPR right-to-be-forgotten, the hard delete needs to happen immediately, not at the next scheduled run. We have a separate force-hard-delete code path that the compliance API triggers.

The audit trail question

Soft delete makes the audit trail richer because the deletion event becomes a recoverable database state, not a gap in history. The audit log should record the deletion with actor, time, and reason (if provided), and the restoration with the same. The audit log itself should be append-only and not subject to soft delete; it is the system of record for what happened.

One pattern that works well is a deletion_metadata JSON column alongside deleted_at, storing the actor and reason and any tombstone information that customers might want to see in the recovery UI ("deleted by Alice on 2026-05-15, reason: 'consolidating monitors'"). The metadata makes the recovery surface more useful and the audit trail richer without requiring schema changes.

Three patterns that fail

First, soft delete without filtering. A team adds a deleted_at column and updates the delete endpoint to set it, but forgets to filter on it everywhere else. Deleted rows show up in list endpoints, in counts, in aggregations. The bug is invisible until a customer complains, and the cost of finding all the places that need the filter is high. The mitigation is to add the filter in a single shared query helper from the start.

Second, soft delete with too-aggressive cleanup. A team sets the recovery window at 24 hours because the storage cost looks scary, and a customer who deletes something on Friday and notices on Monday finds it gone. The recovery window needs to match customer mental models, which are typically days to weeks, not hours.

Third, soft delete on relationships that customers think of as transient. Some resources (audit log entries, webhook delivery attempts, scheduled job runs) are themselves the audit trail of other actions. Soft-deleting them produces a confusing recovery surface and does not match customer mental models. These should be hard-deleted on schedule or retained indefinitely depending on the compliance posture, but not soft-deleted with a recovery window.

Our use across the four products

DocuMint soft-deletes invoices and templates with 30-day recovery. The actual PDF files in object storage are also retained for the window. Hard delete includes both database row and storage object.

CronPing soft-deletes monitors with 30-day recovery, but the ping history associated with a deleted monitor is retained for the full ping-retention window (90 days on paid plans) so that restoration brings back the historical chart. The recovery surface lets customers see deleted monitors with their last-known status.

FlagBit soft-deletes flags and projects with 30-day recovery, with the caveat that a soft-deleted flag still evaluates as its previous default value (not "flag does not exist") for that window. This is deliberate: customers who accidentally delete a flag in production should not have evaluations start failing or returning unexpected defaults. The hard delete after 30 days starts returning the "flag does not exist" behavior.

WebhookVault soft-deletes endpoints with 30-day recovery, but the captured webhook events bound to a deleted endpoint are retained for the standard event-retention window. Deletion of an endpoint stops new captures but preserves history.

The deeper observation

Soft delete is one of those small design choices that ages well or ages badly depending on how thoroughly the implementation respects the asymmetric cost of recovery. The vendor's cost of supporting soft delete (storage, query filtering, recovery UI, cleanup jobs) is paid continuously and is moderate. The customer's cost of not having soft delete materializes rarely but is severe when it does. The expected-value math overwhelmingly favors having the recovery surface, and the products that get it right find their support burden drops markedly because customer mistakes become self-service problems rather than support escalations.


This essay is part of our ongoing series on practical API design. Our products DocuMint, CronPing, FlagBit, and WebhookVault all use the same 30-day soft-delete window with self-service recovery dashboards.

Read more