Event-Driven Architecture
The Problem First
Before getting into how EDA works, I want to understand why it exists. Because the natural instinct when Service A needs to tell Service B something happened is just... call it.
This works. But notice what's happening: the Order Service now has to know about 3 other services. It has to call them in sequence. If Notification Service is slow, the whole thing is slow. If Analytics is down, the order flow might fail. And every time you add a new service that cares about orders, you have to go modify the Order Service again.
The Order Service has become a coordinator. That's the coupling problem — and it compounds fast.
The Core Shift
Instead of Service A calling Service B, Service A records that something happened. It publishes an event to a shared channel. Any service that cares subscribes to that channel and reacts independently.
The Order Service no longer knows who's listening. It doesn't care. It just says "order placed" and its job is done. The broker handles delivery, buffering, and fan-out.
This is the fundamental inversion: instead of the producer orchestrating consumers, consumers are responsible for reacting to things they care about.
What an Event Actually Is
An event is an immutable record of something that happened. Not a command ("do this"), not a query ("give me this") — a fact ("this happened"). You don't change events after publishing them.
{
"eventType": "order.placed",
"eventId": "evt_01HZ9K2M3X",
"timestamp": "2026-03-10T14:23:00Z",
"version": "1.0",
"source": "order-service",
"data": {
"orderId": "ord_9921",
"userId": "usr_441",
"items": [
{ "productId": "prod_88", "quantity": 2, "price": 29.99 }
],
"totalAmount": 59.99
}
}
A few things worth noting here:
eventIdis a unique ID for deduplication — if a consumer receives the same event twice, it can detect and ignore the duplicateversionmatters a lot — events are stored permanently in systems like Kafka, so if you change the shape of the event later, consumers reading old events still need to handle themsourcetells you which service produced it, useful for debugging- The
datapayload should be self-contained enough that consumers don't need to call back to get more info
How the Broker Works Under the Hood
The broker is the backbone. It's what makes the whole thing work. Understanding its internals is important.
Topics and Partitions (Kafka model)
In Kafka, events are written to topics. A topic is just a named log — an append-only, ordered sequence of events. Topics are split into partitions for parallelism.
offset 0,1,2,3..."] P1["Partition 1
offset 0,1,2,3..."] P2["Partition 2
offset 0,1,2,3..."] end Producer -->|write| P0 Producer -->|write| P1 Producer -->|write| P2
When a producer writes an event, Kafka picks which partition to write it to — usually based on a partition key (e.g., orderId). Events with the same key always go to the same partition. This is how Kafka guarantees ordering: events in the same partition are strictly ordered. Across partitions, there's no ordering guarantee.
Each event in a partition has an offset — just an incrementing integer. Consumers track which offset they've read up to. This is how Kafka knows what to deliver next.
How Consumers Read
Kafka uses a pull model. Consumers poll the broker asking "give me events from partition X starting at offset Y." The broker returns a batch. The consumer processes it, commits the offset, and polls again.
The offset commit is what makes at-least-once delivery work. If the consumer crashes after processing but before committing, it'll re-read and re-process the same events on restart. That's why idempotency in consumers matters.
Consumer Groups
Multiple consumers can read from the same topic by forming a consumer group. Kafka assigns each partition to exactly one consumer in the group. This is how you scale consumers horizontally.
Each partition is consumed by exactly one consumer in the group at a time — so no duplicate processing within a group. But two different consumer groups (e.g., inventory-service and notification-service) both get every event independently.
If you have 3 partitions and 4 consumers in a group, one consumer sits idle. If you have 3 partitions and 2 consumers, one consumer handles 2 partitions. The number of partitions is the ceiling on parallelism.
Replication
Kafka replicates each partition across multiple brokers for fault tolerance. One broker is the leader for a partition — it handles all reads and writes. Others are followers that replicate the leader's log.
If the leader goes down, one follower is elected as the new leader. Events that were written to the leader but not yet replicated may be lost — this is the tradeoff between durability and throughput. Kafka's acks setting controls this: acks=all means the leader waits for all followers to confirm before acknowledging the write. Safer, but slower.
Full Request Flow
Here's what actually happens end-to-end when a user places an order in an event-driven system.
The response to the user happens before the consumers do anything. The order is confirmed as soon as it's saved and the event is published. Everything else is eventual.
Delivery Guarantees in Detail
This is one of the trickier parts to get right.
| Guarantee | How It Works | Risk |
|---|---|---|
| At-most-once | Commit offset before processing | Can lose messages if consumer crashes mid-process |
| At-least-once | Commit offset after processing | Can process the same message twice if crash after process but before commit |
| Exactly-once | Transactional producer + consumer | Complex, expensive, not always available |
Most real systems use at-least-once and build idempotent consumers. The logic is: it's better to process a duplicate than to lose an event. So you design your consumers to be safe to run twice — usually by checking "have I already processed this eventId?" before doing any work.
Push vs Pull
Push: The broker pushes events to consumers as they arrive. Lower latency, but if the consumer is slow, the broker can overwhelm it. You need flow control mechanisms.
Pull: The consumer asks for events when it's ready. Natural backpressure — a slow consumer just falls behind in its offset without crashing. The broker doesn't care. This is why Kafka uses pull.
Event Schema and Versioning
This is something I didn't think about at first but it's actually one of the harder operational problems.
Since events are stored in Kafka for potentially months or years, and old consumers still need to read old events, you can't just change the shape of an event freely.
Safe changes: adding optional fields, adding new event types. Dangerous changes: renaming fields, removing fields, changing field types.
The standard approach is to use a schema registry (like Confluent Schema Registry for Kafka) that validates every event against a registered schema before it's published. Consumers fetch the schema by ID embedded in the event. This way, schema changes are versioned and validated.
Where Things Go Wrong
A few failure scenarios I want to keep in mind:
Consumer lag: Consumers fall behind and the lag grows. Usually means consumers are too slow or there's a spike in events. Fix: scale consumers (add more in the group), or optimize processing.
Poison pill: One malformed event that causes a consumer to crash every time it tries to process it. The consumer keeps restarting, re-reading the same event, crashing again. Fix: dead letter queue — after N retries, move the event to a separate topic for manual inspection.
Event ordering issues: If you don't partition by the right key, two events about the same entity can be processed out of order. For example, "order updated" arriving before "order placed" for the same order because they landed in different partitions.
The Tradeoffs, Honestly
What you actually get:
- Services that don't care about each other at runtime — adding a new consumer is zero changes to existing code
- Consumers scale independently based on their own workload
- Kafka keeps the event log, so you can replay history, rebuild state, backfill new consumers
- Natural audit trail
What it costs:
- Eventual consistency everywhere — if you need "did inventory update successfully?" synchronously, this model doesn't fit cleanly
- Operational overhead of running a broker (Kafka clusters are not trivial to manage)
- Debugging a flow requires tracing events across multiple services and logs
- Schema evolution requires discipline — one bad schema change and old consumers break
When I'd use it:
- High-throughput async processing (payments, notifications, analytics pipelines)
- Fan-out — one event, many independent consumers
- When services genuinely have different scaling characteristics
- When audit history and event replay matter
When I wouldn't:
- Simple request/response where you need an immediate answer
- Small apps where the broker becomes more complexity than it solves
- When the team doesn't have the ops capacity to run it reliably
Common Brokers
| Broker | Model | Ordering | Replay | Good For |
|---|---|---|---|---|
| Kafka | Distributed log, pull | Per-partition | Yes, indefinitely | High throughput, audit logs, event replay |
| RabbitMQ | Message queue, push | Per-queue | No (once consumed, gone) | Task queues, flexible routing, lower volume |
| AWS SQS | Managed queue | Best-effort (FIFO queue for strict) | No | Cloud-native, low ops overhead |
| AWS SNS | Pub/sub fanout | No | No | Fan-out to SQS queues or HTTP endpoints |
| Redis Streams | Lightweight log, pull | Per-stream | Yes, within retention | Simple use cases, already running Redis |
The one that comes up most in real backend roles seems to be Kafka, so that's where most of these notes are focused. But the concepts mostly transfer.