Saga Pattern
The Problem First
In a monolith, transactions are easy. You wrap multiple operations in a single DB transaction and either everything commits or everything rolls back.
BEGIN;
UPDATE orders SET status = 'confirmed' WHERE id = 9921;
UPDATE inventory SET stock = stock - 2 WHERE product_id = 88;
INSERT INTO payments (order_id, amount) VALUES (9921, 59.99);
COMMIT; -- or ROLLBACK if anything fails
The database handles atomicity. You don't think about it.
In a distributed system, each service owns its own database. There's no shared transaction across them. So if you're placing an order that involves the Order Service, Inventory Service, and Payment Service — you can't just BEGIN across all three.
orders DB"] IS["Inventory Service
inventory DB"] PS["Payment Service
payments DB"] OS -.->|no shared transaction| IS IS -.->|no shared transaction| PS
If the payment succeeds but inventory update fails, you now have an inconsistent state. The order is paid but stock was never decremented. This is the distributed transaction problem.
Why Not 2-Phase Commit (2PC)?
2PC is the classic solution. A coordinator asks all participants "can you commit?" (prepare phase), and if everyone says yes, it tells them all to commit.
The problem: if the coordinator crashes after sending some commits but not others, participants are left in a locked state waiting forever. It also blocks resources during the prepare phase — all participants hold locks until the coordinator says commit or abort. At scale, this kills performance and availability.
In practice, 2PC is avoided in microservice architectures. The Saga pattern is the alternative.
What a Saga Is
A saga breaks a distributed transaction into a sequence of local transactions, each in their own service. If a step fails, you don't roll back — you run compensating transactions to undo the work already done.
Order Service"] --> T2["2. Reserve Stock
Inventory Service"] T2 --> T3["3. Charge Payment
Payment Service"] T3 --> T4["4. Confirm Order
Order Service"] T3 -->|fails| C2["Compensate:
Release Stock"] C2 --> C1["Compensate:
Cancel Order"]
Each step either succeeds and the saga moves forward, or it fails and compensation runs backwards through the completed steps.
The key insight: compensating transactions are not rollbacks. A rollback undoes at the database level like it never happened. A compensation is a new transaction that semantically reverses the effect — and it might not be perfect. If an email was sent, you can't unsend it. The best you can do is send another email saying "sorry, your order was cancelled."
Two Ways to Coordinate: Choreography vs Orchestration
There are two fundamental approaches to sequencing a saga. They have very different tradeoffs.
Choreography
No central coordinator. Each service listens for events and decides what to do next. The saga "emerges" from services reacting to each other.
Compensation flow on failure:
What I like about it: No single point of failure. Services are completely decoupled — each only knows about its own events. Adding a new step just means a new service subscribes to the right event.
What I don't like: The overall flow is invisible. It's spread across every service. To understand the full saga, you have to read the code of every participant. Debugging a failed saga means hunting across multiple services and logs. And cyclic event dependencies can sneak in without anyone noticing.
Orchestration
A dedicated orchestrator (usually a separate service or a state machine) drives the saga. It calls each participant explicitly and waits for a response.
The orchestrator knows the full state of the saga at every point. It's the single source of truth for "where are we in this flow."
What I like about it: The entire flow is readable in one place. Easier to debug — you query the orchestrator's state to know exactly where a saga is stuck. Easier to add retries and timeout logic centrally.
What I don't like: The orchestrator becomes a dependency. It can become a bottleneck. There's also a risk of it becoming a god service that starts containing too much business logic. You have to be disciplined about keeping it as a coordinator, not a logic owner.
Side-by-Side Comparison
| Choreography | Orchestration | |
|---|---|---|
| Flow visibility | Implicit, scattered | Explicit, centralized |
| Coupling | Low — services only know events | Medium — services know orchestrator |
| Debugging | Hard — trace across services | Easier — query orchestrator state |
| Adding steps | Easy — new subscriber | Requires orchestrator change |
| Single point of failure | No | Yes (orchestrator) |
| Best for | Simple linear flows | Complex flows with branching/retries |
In practice, I'd lean toward orchestration for anything non-trivial. The visibility and debuggability are worth it.
Saga State Machine
A saga is essentially a state machine. At any point, it's in a defined state and transitions on success or failure.
Persisting this state is important. If the orchestrator crashes mid-saga, it needs to recover and know exactly where it left off. This is usually done by storing saga state in a database, updated after each step.
Compensating Transactions in Detail
This is where it gets genuinely hard. Compensations fall into a few categories:
Cleanly reversible: Reserving stock, holding a seat. The compensation just releases the hold. No side effects.
Semantically reversible: A charge can be refunded. But the refund is a new financial transaction, not an undo. It shows up in bank history.
Not reversible: Sending an email, posting to a social feed, triggering a physical shipment. The best you can do is a follow-up action ("your order was cancelled, sorry").
When designing a saga, I'd map out every step and ask: "what does compensation actually mean here?" If too many steps are irreversible, the saga design probably needs rethinking.
Handling Failures and Retries
Not every failure means the saga should compensate immediately. Some failures are transient — the payment service might just be temporarily slow.
The orchestrator should distinguish between retryable failures (timeouts, 503s) and terminal failures (invalid card, fraud rejection). Retryable failures should be retried with exponential backoff before triggering compensation.
Where Things Go Wrong
Partial compensation failure: What if the compensation itself fails? If "release stock" fails after "charge payment" already failed, you're stuck. This is why compensation logic needs to be idempotent and retried aggressively. In the worst case, you need a human alert and manual resolution.
Saga timeout: A step takes too long and you don't know if it succeeded or failed. You need a timeout policy — after N seconds, treat it as failed and compensate.
Concurrent sagas on the same data: Two orders trying to reserve the same last item in stock. You need optimistic locking or reservation holds at the individual service level to handle this.
Real-World Example: E-commerce Order
If payment fails:
Simple, readable, and the orchestrator has the full picture at every step.
When to Use a Saga
Use it when you have a multi-step business process that spans multiple services and needs to handle failures gracefully. Order flows, booking flows, onboarding flows — anywhere you'd naturally think "this is a transaction" but it crosses service boundaries.
Don't use it for simple two-service interactions. If only two services are involved and the failure handling is trivial, the overhead of a full saga isn't worth it. Sometimes a direct call with a retry is enough.
Also: sagas introduce eventual consistency. The system will be in intermediate states during the saga. Make sure your product can tolerate that — "order pending" is a valid state that users might see.