[backend][architecture][distributed-systems][transactions][microservices]

Saga Pattern

Saga Pattern
> How distributed transactions work without 2-phase commit — choreography vs orchestration, compensating transactions, and where it breaks down.

The Problem First

In a monolith, transactions are easy. You wrap multiple operations in a single DB transaction and either everything commits or everything rolls back.

BEGIN;
  UPDATE orders SET status = 'confirmed' WHERE id = 9921;
  UPDATE inventory SET stock = stock - 2 WHERE product_id = 88;
  INSERT INTO payments (order_id, amount) VALUES (9921, 59.99);
COMMIT; -- or ROLLBACK if anything fails

The database handles atomicity. You don't think about it.

In a distributed system, each service owns its own database. There's no shared transaction across them. So if you're placing an order that involves the Order Service, Inventory Service, and Payment Service — you can't just BEGIN across all three.

flowchart LR OS["Order Service
orders DB"] IS["Inventory Service
inventory DB"] PS["Payment Service
payments DB"] OS -.->|no shared transaction| IS IS -.->|no shared transaction| PS

If the payment succeeds but inventory update fails, you now have an inconsistent state. The order is paid but stock was never decremented. This is the distributed transaction problem.


Why Not 2-Phase Commit (2PC)?

2PC is the classic solution. A coordinator asks all participants "can you commit?" (prepare phase), and if everyone says yes, it tells them all to commit.

sequenceDiagram participant C as Coordinator participant OS as Order Service participant IS as Inventory Service participant PS as Payment Service C->>OS: prepare? C->>IS: prepare? C->>PS: prepare? OS-->>C: ready IS-->>C: ready PS-->>C: ready C->>OS: commit C->>IS: commit C->>PS: commit

The problem: if the coordinator crashes after sending some commits but not others, participants are left in a locked state waiting forever. It also blocks resources during the prepare phase — all participants hold locks until the coordinator says commit or abort. At scale, this kills performance and availability.

In practice, 2PC is avoided in microservice architectures. The Saga pattern is the alternative.


What a Saga Is

A saga breaks a distributed transaction into a sequence of local transactions, each in their own service. If a step fails, you don't roll back — you run compensating transactions to undo the work already done.

flowchart LR T1["1. Create Order
Order Service"] --> T2["2. Reserve Stock
Inventory Service"] T2 --> T3["3. Charge Payment
Payment Service"] T3 --> T4["4. Confirm Order
Order Service"] T3 -->|fails| C2["Compensate:
Release Stock"] C2 --> C1["Compensate:
Cancel Order"]

Each step either succeeds and the saga moves forward, or it fails and compensation runs backwards through the completed steps.

The key insight: compensating transactions are not rollbacks. A rollback undoes at the database level like it never happened. A compensation is a new transaction that semantically reverses the effect — and it might not be perfect. If an email was sent, you can't unsend it. The best you can do is send another email saying "sorry, your order was cancelled."


Two Ways to Coordinate: Choreography vs Orchestration

There are two fundamental approaches to sequencing a saga. They have very different tradeoffs.


Choreography

No central coordinator. Each service listens for events and decides what to do next. The saga "emerges" from services reacting to each other.

sequenceDiagram participant OS as Order Service participant K as Broker participant IS as Inventory Service participant PS as Payment Service OS->>K: publish "order.created" K->>IS: order.created IS->>IS: reserve stock IS->>K: publish "stock.reserved" K->>PS: stock.reserved PS->>PS: charge payment PS->>K: publish "payment.completed" K->>OS: payment.completed OS->>OS: confirm order

Compensation flow on failure:

sequenceDiagram participant OS as Order Service participant K as Broker participant IS as Inventory Service participant PS as Payment Service OS->>K: publish "order.created" K->>IS: order.created IS->>IS: reserve stock IS->>K: publish "stock.reserved" K->>PS: stock.reserved PS->>PS: charge payment — fails PS->>K: publish "payment.failed" K->>IS: payment.failed IS->>IS: release reserved stock IS->>K: publish "stock.released" K->>OS: stock.released OS->>OS: cancel order

What I like about it: No single point of failure. Services are completely decoupled — each only knows about its own events. Adding a new step just means a new service subscribes to the right event.

What I don't like: The overall flow is invisible. It's spread across every service. To understand the full saga, you have to read the code of every participant. Debugging a failed saga means hunting across multiple services and logs. And cyclic event dependencies can sneak in without anyone noticing.


Orchestration

A dedicated orchestrator (usually a separate service or a state machine) drives the saga. It calls each participant explicitly and waits for a response.

sequenceDiagram participant O as Saga Orchestrator participant OS as Order Service participant IS as Inventory Service participant PS as Payment Service O->>OS: createOrder OS-->>O: orderCreated O->>IS: reserveStock IS-->>O: stockReserved O->>PS: chargePayment PS-->>O: paymentFailed ❌ O->>IS: releaseStock IS-->>O: stockReleased O->>OS: cancelOrder OS-->>O: orderCancelled

The orchestrator knows the full state of the saga at every point. It's the single source of truth for "where are we in this flow."

What I like about it: The entire flow is readable in one place. Easier to debug — you query the orchestrator's state to know exactly where a saga is stuck. Easier to add retries and timeout logic centrally.

What I don't like: The orchestrator becomes a dependency. It can become a bottleneck. There's also a risk of it becoming a god service that starts containing too much business logic. You have to be disciplined about keeping it as a coordinator, not a logic owner.


Side-by-Side Comparison

flowchart TD subgraph Choreography S1[Service A] -->|event| S2[Service B] S2 -->|event| S3[Service C] S3 -->|event| S1 end subgraph Orchestration O[Orchestrator] -->|command| A[Service A] O -->|command| B[Service B] O -->|command| C[Service C] end
Choreography Orchestration
Flow visibility Implicit, scattered Explicit, centralized
Coupling Low — services only know events Medium — services know orchestrator
Debugging Hard — trace across services Easier — query orchestrator state
Adding steps Easy — new subscriber Requires orchestrator change
Single point of failure No Yes (orchestrator)
Best for Simple linear flows Complex flows with branching/retries

In practice, I'd lean toward orchestration for anything non-trivial. The visibility and debuggability are worth it.


Saga State Machine

A saga is essentially a state machine. At any point, it's in a defined state and transitions on success or failure.

stateDiagram-v2 [*] --> OrderCreated OrderCreated --> StockReserved : reserve stock ✓ OrderCreated --> OrderCancelled : reserve stock ✗ StockReserved --> PaymentCharged : charge payment ✓ StockReserved --> StockReleased : charge payment ✗ StockReleased --> OrderCancelled : compensate PaymentCharged --> OrderConfirmed : confirm order ✓ PaymentCharged --> PaymentRefunded : confirm order ✗ PaymentRefunded --> StockReleased2 : release stock StockReleased2 --> OrderCancelled : cancel order OrderConfirmed --> [*] OrderCancelled --> [*]

Persisting this state is important. If the orchestrator crashes mid-saga, it needs to recover and know exactly where it left off. This is usually done by storing saga state in a database, updated after each step.


Compensating Transactions in Detail

This is where it gets genuinely hard. Compensations fall into a few categories:

Cleanly reversible: Reserving stock, holding a seat. The compensation just releases the hold. No side effects.

Semantically reversible: A charge can be refunded. But the refund is a new financial transaction, not an undo. It shows up in bank history.

Not reversible: Sending an email, posting to a social feed, triggering a physical shipment. The best you can do is a follow-up action ("your order was cancelled, sorry").

flowchart LR subgraph Reversible R1[Reserve Inventory] -->|compensate| R2[Release Inventory] end subgraph Semantic S1[Charge Payment] -->|compensate| S2[Issue Refund] end subgraph Irreversible I1[Send Email] -->|compensate| I2[Send Cancellation Email] end

When designing a saga, I'd map out every step and ask: "what does compensation actually mean here?" If too many steps are irreversible, the saga design probably needs rethinking.


Handling Failures and Retries

Not every failure means the saga should compensate immediately. Some failures are transient — the payment service might just be temporarily slow.

flowchart TD A[Call Payment Service] --> B{Success?} B -->|yes| C[Continue saga] B -->|no - transient| D[Retry with backoff] D --> A B -->|no - permanent| E[Start compensation] D -->|max retries exceeded| E

The orchestrator should distinguish between retryable failures (timeouts, 503s) and terminal failures (invalid card, fraud rejection). Retryable failures should be retried with exponential backoff before triggering compensation.


Where Things Go Wrong

Partial compensation failure: What if the compensation itself fails? If "release stock" fails after "charge payment" already failed, you're stuck. This is why compensation logic needs to be idempotent and retried aggressively. In the worst case, you need a human alert and manual resolution.

Saga timeout: A step takes too long and you don't know if it succeeded or failed. You need a timeout policy — after N seconds, treat it as failed and compensate.

Concurrent sagas on the same data: Two orders trying to reserve the same last item in stock. You need optimistic locking or reservation holds at the individual service level to handle this.


Real-World Example: E-commerce Order

sequenceDiagram participant O as Orchestrator participant OS as Order Service participant IS as Inventory Service participant PS as Payment Service participant FS as Fulfillment Service participant NS as Notification Service O->>OS: createOrder → pending OS-->>O: ok O->>IS: reserveStock(orderId, items) IS-->>O: ok, reserved for 15 min O->>PS: chargePayment(orderId, amount) PS-->>O: ok, charged O->>FS: scheduleFulfillment(orderId) FS-->>O: ok, scheduled O->>OS: updateOrder → confirmed OS-->>O: ok O->>NS: sendConfirmationEmail(userId) NS-->>O: ok

If payment fails:

sequenceDiagram participant O as Orchestrator participant IS as Inventory Service participant PS as Payment Service participant OS as Order Service O->>PS: chargePayment → fails (declined) O->>IS: releaseStockReservation IS-->>O: ok O->>OS: updateOrder → cancelled OS-->>O: ok

Simple, readable, and the orchestrator has the full picture at every step.


When to Use a Saga

Use it when you have a multi-step business process that spans multiple services and needs to handle failures gracefully. Order flows, booking flows, onboarding flows — anywhere you'd naturally think "this is a transaction" but it crosses service boundaries.

Don't use it for simple two-service interactions. If only two services are involved and the failure handling is trivial, the overhead of a full saga isn't worth it. Sometimes a direct call with a retry is enough.

Also: sagas introduce eventual consistency. The system will be in intermediate states during the saga. Make sure your product can tolerate that — "order pending" is a valid state that users might see.