The Outbox Pattern: Reliable Event Publishing Without Distributed Transactions

The problem looks trivial when you first encounter it. A user clicks a button. Your service writes a row to the database and publishes an event to a message queue so that other services can react. Both steps need to happen, or neither. The naive code writes the row, then publishes the event, and most of the time it works. But the network is unreliable, processes crash, and the gap between the database commit and the queue publish is exactly the place where the universe likes to drop messages.

Half the time the message gets lost — the row is written, but the event never reaches the queue, and downstream consumers never know the event happened. The other half, the event reaches the queue but the database commit fails on rollback, so consumers process an event for state that does not exist. Both outcomes are operationally catastrophic in their own way, and both are inevitable at scale unless you do something about them.

The honest answer to the question "how do I atomically write to the database and publish to the queue" is: you cannot, because they are different systems with different commit boundaries, and the two-phase commit protocol that would solve the problem in theory is operationally toxic in practice (slow, locks resources across services, and creates new failure modes worse than the one it solves). The pattern that has emerged as the right answer is the outbox pattern, and the rest of this piece is its mechanics.

The basic shape

The outbox pattern is one schema change and one worker. The schema change adds a table to your application database, typically called outbox or events, with columns for the event id, event type, payload (usually JSON), created-at timestamp, and a status flag indicating whether the event has been published. The worker is a separate process or thread that polls the outbox table for unpublished events, publishes them to the actual message queue, and marks them as published.

The atomicity property comes from the database transaction. When the user clicks the button, your service writes the business row (the user's order, the new invoice, the flag change) and writes the outbox row in the same transaction. Either both commit or neither commits. There is no window where one is durable and the other is not. The event is now reliably persisted in the same database that holds the business state, and the question of getting it to the queue becomes a separate, retryable, idempotent operation handled by the worker.

The worker can fail, restart, double-publish, or skip. None of those failures cause data loss in the queue, because the outbox row is the source of truth and the worker's state in the message queue is recoverable from it. The worst case is a duplicate publish, which the consumer handles via idempotency on the event id (which is why the event id has to be in the outbox row, generated transactionally with everything else).

The polling loop

The worker's polling loop is the place where most outbox implementations succeed or fail in the long run. The naive loop is: SELECT the next batch of unpublished events, publish them one at a time, UPDATE each row to mark it published. This works at low volume and starts to fail in interesting ways as volume grows.

The first failure is throughput. If the worker is single-threaded and publishing one event at a time, throughput is bounded by network latency to the message queue. At a few hundred events per second, this becomes the bottleneck. The fix is batched publishing — many message queues support batch publish APIs, and the worker should use them, with the batch size tuned to the queue's limits and the per-message work the consumer needs to do.

The second failure is multiple-worker safety. As soon as you scale the worker horizontally for redundancy or throughput, you have to make sure two workers do not claim the same outbox row. The right pattern is the same as the job-queue claim pattern: an atomic UPDATE outbox SET claimed_by=worker_id WHERE id IN (SELECT id FROM outbox WHERE status='pending' LIMIT 100 FOR UPDATE SKIP LOCKED) RETURNING *. PostgreSQL's SKIP LOCKED is the load-bearing piece; without it, multiple workers contend for the same rows and either deadlock or process duplicates.

The third failure is unbounded table growth. Outbox tables that retain published events forever become bloated and slow, even with appropriate indexes. The fix is to either delete rows after they have been published and consumed (with a sufficient retention window for replay scenarios) or to move them to a cold archive table. We default to a 30-day retention with a nightly DELETE; longer retention adds cost without value for our use cases.

The variants

The basic pattern has several variants tuned for different operational profiles.

The first is the change-data-capture variant, where instead of a polling worker reading the outbox table, a CDC tool like Debezium tails the database's write-ahead log and forwards inserts to the outbox table directly to the message queue. This eliminates the polling cost and reduces end-to-end latency, at the cost of operating a separate piece of infrastructure that has to be configured, monitored, and version-pinned to your database. For high-volume use cases the latency win pays back, and CDC is increasingly the default at scale. For small-team operations, polling is simpler and good enough.

The second is the materialized-projection variant, where the outbox is the queue. Consumers do not subscribe to a separate message queue; they tail the outbox table directly, often with PostgreSQL LISTEN/NOTIFY for low-latency wake-ups. The advantage is a simpler architecture (one fewer system to operate). The disadvantage is that consumers are coupled to the producer's database, which makes scaling and isolation harder. The pattern is appropriate for small in-house systems where producer and consumer are operated by the same team and the same database is reachable from both.

The third is the partitioned outbox variant for very high volume, where the outbox table is sharded by some natural key (tenant id, user id) and each shard has its own worker. This avoids hot-spot contention on a single outbox table and scales horizontally with the number of shards. The cost is increased operational complexity and the loss of strict ordering across shards. Most teams should not reach for this; the basic pattern handles thousands of events per second on commodity hardware.

The consumer side

The outbox pattern handles producer-side reliability. Consumer-side reliability is a separate problem with a separate solution: idempotency. Every consumer of every event must handle duplicate deliveries correctly, because at-least-once delivery is the default and the outbox worker can publish the same event twice if it crashes after publishing but before marking the row published.

The standard pattern is to track event ids in a consumed-events table on the consumer side, with a unique constraint on event_id. Before processing an event, check the table; if the id is present, skip. After processing, insert the id in the same transaction as any state changes. The check-and-insert can be combined into a single INSERT ... ON CONFLICT DO NOTHING RETURNING id and a guard against the result being empty. The pattern adds a small storage cost (a few bytes per event) and gives you exactly-once-effects from at-least-once-delivery.

What the pattern does not solve

The outbox pattern is not a silver bullet. It does not give you global ordering across multiple producers. It does not give you exactly-once delivery (only at-least-once with consumer-side dedup). It does not protect against business logic bugs that produce wrong events. It does not eliminate the need for an actual message queue (the outbox is a complement, not a replacement, in most architectures).

What it does solve is the specific atomicity problem of writing to the database and publishing to the queue, and that problem is common enough and painful enough that solving it cleanly is worth the schema and worker investment. The alternative — distributed transactions, ad-hoc retry logic, eventual-consistency guesswork — is operationally worse in every dimension we care about.

Our use across products

The four products in this studio use a smaller variant of the pattern because we do not currently run a separate message queue infrastructure. DocuMint's invoice-generated webhooks, CronPing's monitor-failure alerts, FlagBit's flag-change notifications, and WebhookVault's replay deliveries all use a per-product outbox table read by a per-product background worker that delivers directly to customer-supplied webhook URLs. The pattern is the same — atomic write to outbox, idempotent retry by worker — minus the intermediate message queue. The simplification is appropriate for our scale; if we were operating at higher fan-out or with consumers that were our own services rather than customer endpoints, the message queue would earn its place.

The summary

The outbox pattern is the answer to the deceptively hard problem of atomically writing to the database and publishing an event to a queue. The mechanics are simple — one schema change, one worker, careful idempotency on the consumer side — and the failure modes that are eliminated are the kind that produce data loss and silent corruption in production. The pattern has become the default for reliable event publishing in distributed systems, and the variations exist to tune for specific operational profiles rather than to change the fundamental shape. If you are publishing events from a service that also writes to a database, and the events have to reach consumers reliably, the outbox is the pattern you want.