The Saga Pattern: How to Get Distributed Transactions Without Two-Phase Commit

Distributed transactions have a textbook answer and a practical answer, and they are different. The textbook answer is two-phase commit (2PC): a transaction coordinator asks each participant to prepare, then asks each participant to commit. The protocol is correct, the protocol is well-studied, and the protocol is almost universally avoided in production. The reason is that 2PC's failure modes — the coordinator crashing during the commit phase, leaving participants holding locks indefinitely — turn into operational nightmares at any scale where they actually matter.

The practical answer is the saga pattern, formalized by Hector Garcia-Molina and Kenneth Salem in 1987 for a different problem (long-lived transactions in single databases) and adapted in the microservices era for the cross-service case. A saga is a sequence of local transactions, each on a single service, where if any step fails, previously-completed steps are reversed by explicit compensating transactions. There is no global lock, no coordinator that can take down the entire system, and no requirement that any participant support the prepare phase.

The shape of a saga

Consider a typical e-commerce checkout: reserve inventory, charge payment, create shipment, send confirmation email. In a single-database world, this is one transaction; if anything fails, the database rolls back. In a multi-service world, each step touches a different service — inventory, payments, shipping, notifications — and there is no shared transaction context.

The saga formulation: each step is a local transaction that succeeds or fails independently. For each step, we define a compensating transaction that semantically undoes it. Reserve inventory has a release-inventory compensation. Charge payment has a refund-payment compensation. Create shipment has a cancel-shipment compensation. The saga proceeds forward through the steps. If any step fails, the saga runs the compensations for previously-completed steps, in reverse order.

The critical word is "semantically." A compensating transaction does not literally undo the original; it produces the equivalent business state. A refund is not a rollback of a charge — the charge happened, the refund is a new transaction that returns the money. The customer sees both. The accounting reflects both. This is fundamental: sagas trade strict atomicity for eventual consistency with explicit compensation.

Orchestration vs choreography

Sagas come in two flavors. In orchestration, a central coordinator service tracks the saga's progress and tells each step what to do next. In choreography, each service publishes events and listens for events from other services, with no central coordinator — the saga emerges from the event flow.

Orchestration is easier to reason about: there is one place that knows the full saga state, one place to debug, one place to add new steps. The cost is that the orchestrator becomes a critical service that all sagas depend on. Choreography is more resilient: any service can fail without taking down the saga infrastructure, and new steps can be added by adding event listeners. The cost is that no single service knows the full state, which makes debugging harder and makes it easy to accidentally create cyclic event flows that nobody understands.

The pragmatic choice for most teams is orchestration with a lightweight coordinator: a state-machine service or a workflow engine like Temporal that tracks saga state in a database and dispatches the next step. The orchestrator does not need to be highly available in the strict sense — it needs to be durable, so saga state survives a crash, and it needs to be eventually consistent with the participants. These are weaker requirements than 2PC's coordinator.

The compensation problem

The hard part of saga design is not the forward path; it is the compensation path. Some operations are genuinely reversible (release a hold on inventory). Some are reversible with a different transaction (refund a charge). Some are not reversible at all (an email sent cannot be unsent). The saga design must contend with the irreversibility of the real world.

The standard pattern is to order steps so that irreversible operations come last. Send the email after the payment has cleared and the shipment has been created, not before. This is not always possible — sometimes the irreversible operation is required for downstream steps, like sending a verification email before continuing — but it is the right default. When irreversibility cannot be deferred, the saga must accept that some failure modes leave residue (an email sent for a checkout that was later canceled), and the system design must surface that to humans who can deal with it.

The other compensation hazard is the compensation itself failing. The forward path's failure triggered the compensations; what if a compensation fails? The saga literature distinguishes between compensable transactions (forward steps with reliable compensation), pivot transactions (the point of no return — typically the irreversible step), and retriable transactions (steps that must succeed, with retries until they do, no compensation needed). Designing the saga as a sequence of these categories makes the failure semantics clear: before the pivot, the saga can roll back; after the pivot, it can only roll forward.

Idempotency is non-negotiable

Every saga step — both forward and compensating — must be idempotent. The reason is that the saga coordinator can fail and resume, retries can happen, the same step can be invoked twice. If charging a card is not idempotent, the customer is charged twice. If releasing inventory is not idempotent, two releases for one reservation cause inventory to over-credit.

The standard mechanism is an idempotency key carried through the saga and propagated to each step. The participant service records the key when it processes the operation and rejects (or returns the cached result for) duplicate keys. This is the same mechanism described in our earlier piece on idempotency, applied as a hard requirement on every saga participant.

When sagas are wrong

Sagas are not the right tool for every distributed-state problem. They impose meaningful complexity: defining compensations, designing the orchestrator, ensuring idempotency at each participant, handling pivot transactions, dealing with operations that cannot be compensated. The complexity is justified when the participants are genuinely separate services with separate databases and separate teams.

If the participants are services within the same team, sharing a database is often a better answer. The single transaction model is much simpler than a saga, and the cost of database sharing — usually framed as tight coupling — is a real cost but often a smaller one than the saga complexity. The microservices-orthodoxy answer of "every service owns its data" is correct in principle and frequently wrong in practice for small teams.

If the participants are external systems (Stripe, SendGrid, AWS S3) where you cannot control compensation semantics, the saga has to wrap them with adapters that simulate compensation as best they can. Sometimes the adapter exists naturally (Stripe refund, SendGrid is fire-and-forget), sometimes it does not (an external service has no compensation API), and the saga design has to acknowledge the gap. This is one of the cases where deferring the irreversible step is critical.

Our use across products

The four products in this studio — DocuMint, CronPing, FlagBit, and WebhookVault — do not have classical multi-service saga patterns because each product is a self-contained monolith with its own SQLite database. The transaction-equivalent patterns we use are local: a single transaction per request, with idempotency keys on the few endpoints (signup, checkout, webhook delivery) where retries are expected. This is the right design for our scale: the saga complexity is not earned until the system actually has multiple services and shared state across them.

Where we do touch external services (Stripe for payments, Listmonk for email, the Plausible analytics endpoint), the pattern is closer to retriable-transaction-with-graceful-degradation: if Listmonk is down, the signup still succeeds and the email subscription is queued for later; if Stripe checkout fails, the user gets a clear error and no half-state is left in our database. The saga discipline shows up as defensive design at the boundary, not as a coordinator.

The summary

Two-phase commit is correct in the textbook sense and unworkable at scale because of the coordinator's role in the commit phase. The saga pattern is the practical alternative for cross-service transactions: a sequence of local transactions with explicit compensations, oriented around eventual consistency rather than strict atomicity, with idempotency at each step as the safety net. Designing a saga is harder than designing a single transaction — the compensation semantics, the pivot point, the irreversibility of some real-world operations all have to be reasoned about explicitly. The reward is a system that survives partial failures without locking up, that scales to multiple services without a global coordinator, and that produces a state model the operations team can actually debug. Most teams that reach for distributed transactions are in the wrong place — but for the teams that genuinely need them, sagas are the answer that survives contact with production.