Designing API Background Job Status Endpoints: How Customers Build Reliable Polling

Long-running operations return a job ID instead of a result. Customers poll the status endpoint until the job is complete. The shape of that endpoint determines how much customer code is needed to handle the workflow correctly.

Most non-trivial APIs eventually have at least one operation that does not finish within a reasonable HTTP request timeout. Bulk imports, file generation, data exports, multi-step provisioning, training operations, and reindexing all fall into this category. The standard pattern is well-established: the initial request returns 202 Accepted with a job identifier, the customer polls a status endpoint with that identifier, and eventually the status indicates completion with a pointer to results.

The shape of the status endpoint is where the work happens. A poorly designed status endpoint forces customer code to handle ambiguous responses, race conditions between the polling client and the worker, and retry behavior that does not match the customer's actual error model. A well-designed status endpoint surfaces enough information that customers can write straightforward polling loops without defensive case analysis.

The minimum useful response shape

The minimum information a status endpoint needs to surface is: the current state of the job (pending, running, completed, failed, cancelled), the time the state changed, and a way to find the result if the job is complete. Everything else is optional but often valuable.

A typical response shape looks like this:

{
  "id": "job_01H7X3...",
  "object": "job",
  "type": "bulk_invoice_generation",
  "status": "running",
  "created_at": "2026-06-03T12:00:00Z",
  "started_at": "2026-06-03T12:00:01Z",
  "completed_at": null,
  "progress": {
    "current": 234,
    "total": 1000,
    "rate_per_second": 12.4
  },
  "result_url": null,
  "error": null
}

The id and status are essential. The created_at and completed_at let the customer compute durations without server-side support. The progress block is the most underrated field: customers can show progress bars and estimate completion times. The result_url is null while the job is running and points at a separate resource when complete.

The five-state machine

A robust job status field uses a small enumerated set of states with explicit transitions. The standard set is pending (created but no worker has started), running (a worker is processing), completed (work is done, result is available), failed (work attempted, did not succeed, error info attached), and cancelled (customer asked for cancellation and it took effect).

Customers will write code that switches on the status. Adding new states later is a breaking change in practice, because customer code that explicitly handles known states often falls through to a default that does the wrong thing for unknown states. The five-state model is large enough to cover most workflows and small enough that adding a sixth state is rarely tempted.

Two state transitions are subtle. The first is pending to cancelled: a customer can cancel a job that has not started yet, and the cancellation succeeds without any worker involvement. The second is running to cancelled: cancellation while a worker is active typically does not stop the worker immediately, so the status transition lags the cancellation request. Documenting the lag explicitly prevents customer confusion.

Polling cadence and Retry-After

Naive polling loops hammer the status endpoint at fixed intervals, often as fast as the customer code can issue requests. This is wasteful for both sides. The server bears the cost of many no-op status reads; the customer wastes time and bandwidth.

The right pattern is to use the Retry-After header to signal the appropriate polling interval. While the job is in a stable state (pending, running with stable progress), Retry-After can be long (5-30 seconds). When the job is about to finish based on progress data, Retry-After can be short (1-2 seconds). Customers who honor Retry-After get reasonable polling behavior with no client-side estimation logic.

The customer-visible header semantics should be: Retry-After indicates the minimum time the customer should wait before the next status request. Customers can wait longer if they prefer; they should not wait shorter. The server can use Retry-After to throttle aggressive pollers without returning 429 errors.

Idempotency and version semantics

Status endpoints are idempotent by definition: a GET on /jobs/{id} returns the current state, regardless of how many times it has been called. The non-obvious question is what happens when the status changes between two polls. Customers reading the response from one poll and the response from the next poll need to be able to detect that progress has been made.

The simplest mechanism is monotonically increasing fields. The updated_at timestamp moves forward each time the job state changes. The progress.current counter moves forward as work is done. Customers can compare timestamps and counters across responses to detect movement.

A more sophisticated mechanism is an ETag on the status response, allowing customers to use If-None-Match for cheap polling. The server returns 304 Not Modified when the job state has not changed since the customer's last poll. This eliminates the response body cost for no-op polls. The cost is server-side ETag computation, which is typically a hash of the state-affecting fields.

Three patterns that fail

The first is conflating job not found with job not done. A customer that polls a job ID and gets a 404 cannot distinguish between a job that was deleted, a job whose ID was malformed, and a job that has not yet been persisted. The right pattern is to make job records persist for a documented retention period (often 24-72 hours past completion) and to return 404 only for genuinely unknown IDs.

The second is opaque error fields. A failed job returns status "failed" and an error field with a string message. If the message is a generic "Job failed" without structured information, the customer has no way to handle different error categories programmatically. The error field should contain at least a stable error code, a human-readable message, and a request_id for support correlation. Optional fields can include retry suggestions, documentation URLs, and partial results.

The third is silently extending job execution time. Jobs that are stuck (worker died, queue starved, dependency unavailable) sometimes show "running" status indefinitely. The right pattern is to enforce a server-side maximum job duration, mark jobs that exceed it as failed with a stuck timeout error, and clean up the worker side. Customers polling a stuck job at least get an actionable failure instead of polling forever.

What status endpoints should not do

The first thing they should not do is return the result inline. The result of a job is often large (multi-megabyte exports, generated files, large data sets) and the status response should be small. The result_url field points at a separate endpoint that returns the actual result, often with content negotiation or download semantics. Inlining the result mixes the status workflow with the result workflow and produces large status responses for completed jobs that customers may poll multiple times.

The second is allowing mutations through the status endpoint. The status endpoint is a GET; the cancel operation should be a separate POST or DELETE on a different resource (typically /jobs/{id}/cancel or DELETE /jobs/{id}). Mixing read and write on the same endpoint produces ambiguous semantics for customers and complicates caching.

The third is exposing internal implementation details. The status field should be the small enumerated set; internal worker state (queue position, current SQL statement, current file being processed) should not leak into the response. Customers will write code that depends on whatever fields the response contains. If the worker implementation changes, the status response should not change in customer-visible ways.

What we do across our use of background jobs

Background jobs show up across the Builds app (bulk listing imports planned for later phases), and the future expansion candidates the Anethoth studio is building. The shared pattern is the five-state machine, the progress block when meaningful, Retry-After-driven polling, and stable error codes in the failure case. The cost of building this consistently is low compared to the customer support burden of inconsistent job APIs.

The deeper observation is that the status endpoint is the only customer-visible artifact of the asynchronous workflow. The worker can be elegant or ugly, parallel or serial, in-process or external; none of that matters to the customer. What matters is whether the customer's polling loop is straightforward to write and reliable in production. The status endpoint is the contract that determines whether the asynchronous design is a feature or a tax.


Read more essays and technical writing at anethoth.com — a notebook on databases, distributed systems, biology, and the engineering that holds the world together.