Designing API Cancellation: How to Cancel Long-Running Operations Without Race Conditions

Most APIs handle starting work well and finishing work well. The middle case (telling them to stop) is where the design tradeoffs get interesting, especially when the operation has already been partly committed.

Cancellation looks easy from the outside: the customer hits cancel, the operation stops, the API returns success. From the inside it is one of the harder API design problems, especially when the operation has already produced partial work that has to be either kept or rolled back. The interesting decisions are about when cancellation is allowed, what state the resource is in afterward, and how the system handles cancel requests that race with completion.

We hit cancellation across all four products: DocuMint for batch PDF generation, CronPing for monitor backfill jobs, FlagBit for bulk flag updates, and WebhookVault for replay sweeps. Each forced a different decision about cancellation semantics.

The three cancellation models

Three distinct models cover almost all production API cancellation:

  • Best-effort cancellation: the cancel request signals the worker, which checks the signal at convenient points and exits cleanly when it sees one. The operation may complete fully if the cancel arrives after the last checkpoint.
  • Hard cancellation: the cancel request kills the worker immediately, and any uncommitted work is discarded. Already-committed side effects remain.
  • Compensating cancellation: the cancel request triggers a saga that reverses each completed step in reverse order. The end state is approximately the pre-operation state.

Best-effort is the easiest to implement and the most predictable to operate. Hard cancellation is risky because it can leave the system in an inconsistent state if the worker was mid-write. Compensating cancellation is the most expensive and the closest to "cancel means undo" semantics that customers usually want.

Most APIs should use best-effort. The key design decision is how often the worker checks the cancel signal. Too frequently and you waste CPU; too rarely and the cancel feels unresponsive. The sweet spot is usually at natural checkpoints: between batches, after each side effect, before each external call.

The race condition

The classic cancellation race is: customer hits cancel at time T, worker completes the operation at time T+epsilon. The cancel arrives at the API server, is forwarded to the worker, and finds nothing to cancel. What status does the API return?

The wrong answer is to return 404 because the operation is no longer in-progress. The customer sees an error and assumes their cancel did not work, which leads them to retry or check status, both of which are unnecessary. The right answer is to return 200 with the current operation status: completed, with a note that it finished before cancel could take effect. The customer's mental model ("I asked you to stop, you told me what happened") is preserved.

The implementation requires the cancel handler to atomically check the operation's current status. If still in-progress, set the cancel flag and return cancellation-requested. If already completed, return the completed status. If already cancelled, return idempotently. This is a single transactional read-and-conditional-write on the operation record.

The state machine

An operation that supports cancellation needs at least these states:

  • pending: queued but not started. Cancel here is free: just delete the queue entry.
  • running: started, not done. Cancel here sets a flag; the worker checks it.
  • cancel_requested: cancel signal received but worker has not exited yet. Used to distinguish from running for status queries.
  • cancelled: worker exited cleanly after seeing cancel signal. Terminal state.
  • completed: worker finished normally. Terminal state.
  • failed: worker exited with error. Terminal state.

The terminal states are important: they signal that further cancel requests on this operation are no-ops and the API can return the current status idempotently. Without explicit terminal states, you end up with corner cases where a long-completed operation accepts a cancel request and the customer is confused about what happened.

The partial-result problem

What happens to work the operation completed before cancel? Three reasonable answers exist, depending on the operation:

  • Keep partial: 47 of 100 PDFs generated; the customer gets the 47. Used for batch operations where partial results have value.
  • Discard partial: 47 PDFs generated and stored, then deleted on cancel. Used for transactions where partial state is meaningless or harmful.
  • Compensate partial: 47 PDFs generated, then explicitly unwound (deleted from storage, refunded billing). Used for operations with external side effects.

The operation type usually dictates which is right. The customer-visible behavior should be documented explicitly, because cancel-with-partial-results is a frequent source of customer confusion. Our default is keep-partial for batch operations and discard-partial for transactional operations, with explicit per-endpoint documentation when the default is overridden.

The cancellation API surface

The minimum useful API for cancellation is two endpoints: a POST to request cancellation, and a GET to check status. The POST should be idempotent: repeated cancel requests on the same operation return the same response. The GET should include enough state for the customer to reason about what happened (status, started_at, completed_at, cancelled_at, partial_results_summary).

The bulk-cancel pattern (cancel all in-progress operations matching a filter) is occasionally needed for incident response, but it should be rate-limited and audit-logged. Customers rarely need it; operators occasionally need it; the surface area for misuse is large.

The deeper observation

Cancellation is one of those API features whose absence is invisible until customers need it, at which point its absence becomes the most visible thing about the API. The right approach is to design cancellation in from the start for any operation that takes longer than a few seconds, even if the initial implementation is best-effort. Adding cancellation to a long-running endpoint after launch usually requires schema changes, worker changes, and customer-communication; designing it in costs almost nothing.

The deeper lesson is that asynchronous APIs are not just "synchronous APIs that take longer." They have a fundamentally different lifecycle: start, observe, intervene, complete. The intervene step (cancellation, pause, resume, reprioritize) is what distinguishes a thoughtful async API from one that just took the easy way out of a synchronous design.

Read more