Designing Long-Running Job Status APIs: Polling, Webhooks, and Server-Sent Events

Some operations take longer than a request-response cycle can hold. A status API has to tell the customer what is happening without keeping the connection open. The three patterns are polling, webhooks, and SSE, and they fit different customer profiles.

Most API operations finish in milliseconds. A small minority take longer than any reasonable request-response cycle can hold: large PDF batches, multi-region webhook replays, flag rollout simulations across millions of users, archive exports that scan years of data. These operations cannot return their result in the response to the triggering request. They need a separate status surface that customers can query.

The natural temptation is to return a job ID and tell the customer to poll for results. This works but has costs that compound. Customers who poll too often produce unnecessary load. Customers who poll too rarely wait longer than they need to. Customers who forget to handle the status transitions correctly miss completion events. The right design starts from the customer side: what does the integration code want to look like when this operation takes 30 seconds vs 30 minutes vs 30 hours?

The shape of a job status API

The minimum surface is a POST that creates the job and a GET that returns the status. The POST returns 202 Accepted with a Location header pointing to the status endpoint and a body containing the job ID. The status endpoint returns the current state plus, when the job completes, either the result inline or a URL to fetch the result from.

The state values matter for client logic. A small set is enough: pending (created but not started), running (in progress), completed (succeeded), failed (terminal error), cancelled (user-cancelled). Each terminal state must be sticky: once the job is completed, the status endpoint must return completed for as long as the job is retained, regardless of subsequent events.

Progress information is helpful but not required. The clean approach is a progress field expressed as a fraction or as {done, total}. The honest approach is to omit progress when the total is genuinely unknown rather than fake it with arbitrary milestones. Customers find a missing progress bar acceptable. They find a progress bar that hangs at 99 percent infuriating.

Retention of completed jobs is a design choice with consequences. Twenty-four hours is the right default: long enough for customers to retrieve results after weekend triggering, short enough that the table does not grow unboundedly. The status endpoint should return 404 after retention expires, not 410 Gone. The customer-side effect is the same and 404 is more universally handled.

The polling pattern

Polling is the default that always works. The customer calls the status endpoint on a schedule until the state becomes terminal. The polling interval is the design lever: too short and the customer wastes their own and your CPU, too long and the customer experiences avoidable latency.

The right approach is server-controlled. The status response should include a Retry-After header with the number of seconds the customer should wait before the next poll. The value can vary: 1 second while the job is starting, 5 seconds while running, longer when the job is known to take hours. The server has visibility into the expected job duration that the client does not, and the header lets the server share that information.

The customer-side discipline is to honor the header. SDKs should poll at the rate the server specifies, with optional exponential backoff if the server is unavailable. Customers who poll faster than Retry-After suggests should be rate-limited eventually, but the friendly behavior is to log and ignore rather than reject. The rate limit prevents pathological customer code from overwhelming the status endpoint.

The polling pattern fits jobs that complete within minutes and customers who are actively waiting for the result. For longer jobs or customers who want to walk away, polling is inefficient.

The webhook pattern

For longer-running jobs and customers who do not want to keep a client connected, webhooks deliver the completion event to a customer-supplied URL. The customer registers a webhook URL (per-account or per-job), the job runs, and on terminal state the system POSTs an event to the registered URL.

The webhook payload should include the job ID, the terminal state, a timestamp, and either the result inline or a URL to fetch it. The customer-side handler validates the signature, deduplicates by job ID and event ID, and processes the result. The standard webhook discipline applies: ack within a few seconds, do the actual work asynchronously, idempotency is the customer's responsibility.

The combined pattern is more powerful than either alone. Customers register a webhook for completion, then either wait for the webhook or fall back to polling if the webhook does not arrive within an expected window. The polling acts as a recovery mechanism when the webhook is missed due to network failures, customer-side downtime, or registration errors. The job system treats polling as authoritative: a customer who polls and sees completed can proceed without waiting for the webhook.

This pattern fits the bulk of B2B SaaS workloads where customers have stable webhook infrastructure and want to be notified rather than poll continuously. The cost is that the customer has to operate a webhook receiver, which not all customers can do (mobile apps, single-page web apps, scripts running on developer laptops).

The server-sent events pattern

For interactive workloads where the customer wants real-time progress updates, server-sent events (SSE) deliver a stream of events over a long-lived HTTP connection. The customer connects to GET /jobs/{id}/events with Accept: text/event-stream, and the server streams progress events as the job runs, ending with a completion event when the job terminates.

SSE is well-suited to dashboard use cases where a user is watching the job progress in real time. It is poorly suited to backend integrations where the customer has no UI and just wants to know when the job is done. For backend use, polling or webhooks are better fits.

The implementation cost is non-trivial. SSE requires the server to maintain long-lived connections, which interacts poorly with stateless deployment patterns and connection limits. Load balancers may close idle connections. Customers on flaky networks may disconnect frequently. The fallback to polling must be transparent: the customer-side library should retry on disconnect and rely on polling to catch any events missed during disconnection.

For the dashboard use case, SSE pays off because the alternative (polling at high frequency to feel real-time) is much more expensive on both sides. For all other cases, the operational cost of SSE rarely pays back compared to the simpler polling-plus-webhook pattern.

What to do when jobs partially complete

Many long-running operations produce partial results before terminating. A batch PDF generation may produce 80 out of 100 invoices before a failure on number 81. The status API has to decide how to represent this.

The clean answer is to make partial results addressable. The job has a top-level state and a per-item state for each subtask. The status response includes the count of completed items and a way to fetch the completed results. Customers can choose whether to retry the failing items or accept the partial result and move on.

The alternative (all-or-nothing terminal state) is simpler but throws away work. For operations that take meaningful time and resources, partial-result addressability is usually worth the complexity.

The status table schema

The minimum schema is a single jobs table with columns for ID, account, type, state, created_at, started_at, completed_at, result_url, error_message, and a JSONB payload for type-specific data. Index on (account, state) for the "show me my running jobs" query and on (state, created_at) for the worker pickup query.

The result_url is the link the customer fetches to get the actual output. For small results, the URL can be a presigned URL pointing at object storage. For larger results, the URL can point back to a streaming download endpoint. The point of using a separate URL rather than inlining the result is that result data can be much larger than the status response should reasonably be.

The error_message column holds customer-facing error text. The format should be the same as your synchronous-API error responses: structured code, message, and optional remediation hint. Customers who handle synchronous errors with one code path will use the same code path for asynchronous errors.

What we do across the four products

Across DocuMint, CronPing, FlagBit, and WebhookVault, the long-running surface is small but growing. DocuMint batch invoice generation can take minutes for hundreds of invoices, and the customer-facing API is currently synchronous with a higher timeout. WebhookVault replay across thousands of events is the most obvious case for status-based asynchronous work, and the design follows the polling-plus-webhook pattern described here.

The pattern that has held up best in customer feedback is to make all long-running operations work with both polling and webhooks. Customers with infrastructure use webhooks. Customers without it poll. The system supports both because the additional cost is small and the customer flexibility is large.

The deeper observation is that the choice of polling vs webhooks vs SSE is mostly a customer-side question, not a server-side question. The server's job is to make a clean status surface that all three patterns can use. The customer chooses the pattern that fits their infrastructure. Designing the status API to work well with all three reduces friction without adding much complexity on the server side.

Read more