Beyond the Basic Queue: Async Job Patterns That Scale With Complexity

Most async work in a typical backend is genuinely simple. A user does something, a job is enqueued, a worker picks it up, the work happens. Email sent. Webhook delivered. Thumbnail generated. The basic queue handles this for years before it stops being enough.

The complexity grows when work has structure. A user signs up, which fires off five jobs: send welcome email, provision tenant, generate API keys, populate sample data, schedule onboarding follow-up. Two of those depend on tenant provisioning finishing first. One of them needs to retry idempotently if it fails. The success email should only fire if all five succeed. Suddenly the simple queue is not enough; you need a way to express dependency, parallelism, and partial failure.

Most teams reach this point and reach for an orchestration framework: Temporal, Airflow, Dagster, AWS Step Functions. These are excellent tools and absolutely the right answer at scale. But there is a useful middle ground where small explicit patterns, implemented with the queue you already have, carry you a long way without the framework's operational weight.

Chained jobs

The simplest dependency relationship: job B depends on job A. The pattern: when job A finishes, it enqueues job B with the data B needs.

The mistake is putting the chain definition in the code that enqueues A. That code does not know what comes after A; A's worker does. The chain belongs at the end of A. This makes A reusable in different contexts (it does not always need to fire B) and keeps the dependency local to where the dependency exists.

The corollary: each job's last action before returning success is to enqueue any follow-up jobs. If you want a transaction-like guarantee that B will run if A succeeded, the enqueue happens in the same database transaction as A's success state. If A is using "outbox" pattern (writing the next-job intention to a table that a poller turns into queue messages), this gives you exactly-once-ish semantics without distributed transactions.

Fan-out / fan-in

The pattern: one job spawns N parallel jobs, then waits for all of them to complete before triggering a follow-up.

Implementation requires a coordinator: a small piece of state (one row) tracking how many child jobs were spawned and how many have completed. Each child, on completion, increments the completed count. The Nth completion triggers the follow-up.

The interesting bit is the increment. It must be atomic against concurrent child completions, and it must handle child retries (don't double-count). The simplest approach: each child has a unique ID, and the coordinator stores a set of completed-child IDs. The increment is "add my ID to the set if not already present." When the set size equals the spawn count, fire the follow-up.

In SQL: INSERT INTO fanin_completion (parent_id, child_id) VALUES (?, ?) ON CONFLICT DO NOTHING followed by SELECT COUNT(*) FROM fanin_completion WHERE parent_id = ?. The select is the gate; if it equals the expected count, this child is the one that triggers the follow-up. A second child that retries will hit the conflict, see the same count, and not fire the follow-up again.

Sagas: long-running work with explicit compensation

Some workflows touch multiple services and need to roll back if a later step fails. Classic example: travel booking. Reserve flight. Reserve hotel. Charge card. If the card fails, cancel the hotel and the flight. If the hotel fails, only cancel the flight.

The naive approach is to use a distributed transaction. The right approach is the saga pattern: each forward step has an explicit compensating action. The orchestrator runs forward steps in order, recording each completion. If a step fails, it runs the compensating action for each completed step in reverse order.

The key property is that compensation is operationally a forward operation, not a rollback. You are not undoing the original work; you are doing new work that semantically reverses the original. "Refund the charge" is a new operation, not a transaction rollback. This is harder to reason about than transactions but works across systems that do not share a transaction context.

For small sagas (3-5 steps), this can be implemented as a chain of jobs with a per-saga state row tracking which steps have completed. For more complex sagas, an orchestration framework starts to earn its keep.

Idempotent retries

Every async job will eventually run twice. Either the worker crashed after doing the work but before acknowledging completion, or a network blip caused a duplicate enqueue, or someone manually retried. The job must do the same thing the second time without duplicating side effects.

The pattern: each job has a stable, content-addressed ID derived from its inputs (not the queue's auto-generated ID). The job's first action is to check whether work for this ID has already been done; if so, return success without doing it again.

The mechanism that works in SQL: a job_executions table with a unique index on (job_type, idempotency_key). The first action is INSERT OR IGNORE with a started_at timestamp. If the insert affects zero rows, this is a duplicate; load the prior result and return. If the insert succeeds, do the work, update the row with the result, return.

The detail that catches teams: the idempotency key must be derived from inputs, not generated fresh. If the key is generated when the job is enqueued, a re-enqueue produces a different key and the deduplication fails. The key must be a hash of (operation type + significant inputs) so that the same logical operation always produces the same key.

Dead letter handling

Jobs that fail their retry budget should not vanish. They should land in a dead-letter table where a human can inspect them, fix the underlying problem, and explicitly retry or discard them.

The dead-letter table is not a queue; it is a triage list. The columns: original job payload, failure reason, attempt count, original timestamp, last attempt timestamp, status (new, investigated, retrying, abandoned). The interface for it is a UI page that lists dead-letter rows, lets an operator click "retry" or "abandon," and writes the resolution back.

The dead-letter table is the oldest pattern in async work and remains the most under-implemented. Most teams assume retries will eventually succeed; in practice, somewhere between 0.1% and 5% of jobs hit the dead letter, and without a triage path they accumulate as silent data loss.

Observability

Async jobs are easier to lose track of than synchronous requests because no human is waiting for them. The minimum observability:

Queue depth. Alert if it grows monotonically.
Job age at processing. Time from enqueue to start. P99 of this is your effective async latency.
Failure rate per job type. Aggregated. A single job type at 50% failure is invisible if you only track total failures.
Dead-letter rate. Jobs hitting the dead letter per hour. Should be near zero in steady state.
Saga completion rate. If you have multi-step workflows, the rate at which they complete vs hit a compensating path.

When to graduate

The patterns above carry you to roughly hundreds of thousands of jobs per day on a single SQLite or Postgres queue. Beyond that, the queue itself starts to be the bottleneck and a real broker (RabbitMQ, Redis Streams, Kafka) earns its place. At thousands of saga-style workflows in flight, an orchestration framework becomes worth its operational complexity.

The transition is rarely abrupt. Most systems run hybrid for a long time: the simple queue handles 95% of jobs, an orchestration framework handles the few workflows that genuinely need the framework's primitives. The patterns above are not replaced; they are joined by a more capable layer for the cases that need it.

We use these patterns on CronPing's monitoring loop (chained jobs for ping evaluation), WebhookVault's replay system (idempotent retries with dead-letter), FlagBit's rollout calculations (fan-out / fan-in across user buckets), and DocuMint's PDF generation pipeline. None of them needs an orchestration framework yet. The day they do, the chain-of-jobs structure will translate cleanly to whichever framework wins, because the dependency structure is already explicit in the code.