[backend][architecture][distributed-systems][events][kafka]

Event-Driven Architecture

Event-Driven Architecture
> How event-driven systems actually work under the hood — brokers, consumers, delivery guarantees, and the tradeoffs you have to live with.

The Problem First

Before getting into how EDA works, I want to understand why it exists. Because the natural instinct when Service A needs to tell Service B something happened is just... call it.

sequenceDiagram participant OS as Order Service participant IS as Inventory Service participant NS as Notification Service participant AS as Analytics Service OS->>IS: POST /inventory/decrement IS-->>OS: 200 ok OS->>NS: POST /notifications/send NS-->>OS: 200 ok OS->>AS: POST /analytics/record AS-->>OS: 200 ok

This works. But notice what's happening: the Order Service now has to know about 3 other services. It has to call them in sequence. If Notification Service is slow, the whole thing is slow. If Analytics is down, the order flow might fail. And every time you add a new service that cares about orders, you have to go modify the Order Service again.

The Order Service has become a coordinator. That's the coupling problem — and it compounds fast.


The Core Shift

Instead of Service A calling Service B, Service A records that something happened. It publishes an event to a shared channel. Any service that cares subscribes to that channel and reacts independently.

flowchart LR OS[Order Service] -->|"order.placed"| B[Message Broker] B --> IS[Inventory Service] B --> NS[Notification Service] B --> AS[Analytics Service]

The Order Service no longer knows who's listening. It doesn't care. It just says "order placed" and its job is done. The broker handles delivery, buffering, and fan-out.

This is the fundamental inversion: instead of the producer orchestrating consumers, consumers are responsible for reacting to things they care about.


What an Event Actually Is

An event is an immutable record of something that happened. Not a command ("do this"), not a query ("give me this") — a fact ("this happened"). You don't change events after publishing them.

{
  "eventType": "order.placed",
  "eventId": "evt_01HZ9K2M3X",
  "timestamp": "2026-03-10T14:23:00Z",
  "version": "1.0",
  "source": "order-service",
  "data": {
    "orderId": "ord_9921",
    "userId": "usr_441",
    "items": [
      { "productId": "prod_88", "quantity": 2, "price": 29.99 }
    ],
    "totalAmount": 59.99
  }
}

A few things worth noting here:

  • eventId is a unique ID for deduplication — if a consumer receives the same event twice, it can detect and ignore the duplicate
  • version matters a lot — events are stored permanently in systems like Kafka, so if you change the shape of the event later, consumers reading old events still need to handle them
  • source tells you which service produced it, useful for debugging
  • The data payload should be self-contained enough that consumers don't need to call back to get more info

How the Broker Works Under the Hood

The broker is the backbone. It's what makes the whole thing work. Understanding its internals is important.

Topics and Partitions (Kafka model)

In Kafka, events are written to topics. A topic is just a named log — an append-only, ordered sequence of events. Topics are split into partitions for parallelism.

flowchart TD subgraph "Topic: order.placed" P0["Partition 0
offset 0,1,2,3..."] P1["Partition 1
offset 0,1,2,3..."] P2["Partition 2
offset 0,1,2,3..."] end Producer -->|write| P0 Producer -->|write| P1 Producer -->|write| P2

When a producer writes an event, Kafka picks which partition to write it to — usually based on a partition key (e.g., orderId). Events with the same key always go to the same partition. This is how Kafka guarantees ordering: events in the same partition are strictly ordered. Across partitions, there's no ordering guarantee.

Each event in a partition has an offset — just an incrementing integer. Consumers track which offset they've read up to. This is how Kafka knows what to deliver next.

How Consumers Read

Kafka uses a pull model. Consumers poll the broker asking "give me events from partition X starting at offset Y." The broker returns a batch. The consumer processes it, commits the offset, and polls again.

sequenceDiagram participant C as Consumer participant K as Kafka Broker participant DB as Offset Store C->>K: fetch(topic, partition=0, offset=42) K-->>C: events [42, 43, 44] C->>C: process events C->>DB: commit offset 44 C->>K: fetch(topic, partition=0, offset=45)

The offset commit is what makes at-least-once delivery work. If the consumer crashes after processing but before committing, it'll re-read and re-process the same events on restart. That's why idempotency in consumers matters.

Consumer Groups

Multiple consumers can read from the same topic by forming a consumer group. Kafka assigns each partition to exactly one consumer in the group. This is how you scale consumers horizontally.

flowchart LR subgraph "Topic: order.placed" P0[Partition 0] P1[Partition 1] P2[Partition 2] end subgraph "Consumer Group: inventory-service" C1[Consumer 1] C2[Consumer 2] C3[Consumer 3] end P0 --> C1 P1 --> C2 P2 --> C3

Each partition is consumed by exactly one consumer in the group at a time — so no duplicate processing within a group. But two different consumer groups (e.g., inventory-service and notification-service) both get every event independently.

If you have 3 partitions and 4 consumers in a group, one consumer sits idle. If you have 3 partitions and 2 consumers, one consumer handles 2 partitions. The number of partitions is the ceiling on parallelism.

Replication

Kafka replicates each partition across multiple brokers for fault tolerance. One broker is the leader for a partition — it handles all reads and writes. Others are followers that replicate the leader's log.

flowchart LR Producer -->|write| L[Leader\nBroker 1] L -->|replicate| F1[Follower\nBroker 2] L -->|replicate| F2[Follower\nBroker 3] Consumer -->|read| L

If the leader goes down, one follower is elected as the new leader. Events that were written to the leader but not yet replicated may be lost — this is the tradeoff between durability and throughput. Kafka's acks setting controls this: acks=all means the leader waits for all followers to confirm before acknowledging the write. Safer, but slower.


Full Request Flow

Here's what actually happens end-to-end when a user places an order in an event-driven system.

sequenceDiagram participant U as User participant OS as Order Service participant DB as Orders DB participant K as Kafka participant IS as Inventory Service participant NS as Notification Service U->>OS: POST /orders OS->>DB: INSERT order (status: pending) OS->>K: produce event "order.placed" K-->>OS: ack (offset stored) OS-->>U: 201 Created { orderId } par Async — happens independently K->>IS: deliver "order.placed" IS->>IS: decrement stock IS->>K: produce "inventory.updated" and K->>NS: deliver "order.placed" NS->>NS: send confirmation email end

The response to the user happens before the consumers do anything. The order is confirmed as soon as it's saved and the event is published. Everything else is eventual.


Delivery Guarantees in Detail

This is one of the trickier parts to get right.

Guarantee How It Works Risk
At-most-once Commit offset before processing Can lose messages if consumer crashes mid-process
At-least-once Commit offset after processing Can process the same message twice if crash after process but before commit
Exactly-once Transactional producer + consumer Complex, expensive, not always available

Most real systems use at-least-once and build idempotent consumers. The logic is: it's better to process a duplicate than to lose an event. So you design your consumers to be safe to run twice — usually by checking "have I already processed this eventId?" before doing any work.

flowchart TD A[Receive event] --> B{Already processed?\nCheck eventId in DB} B -->|yes| C[Skip — return early] B -->|no| D[Process event] D --> E[Mark eventId as processed] E --> F[Commit offset]

Push vs Pull

flowchart LR subgraph "Push - RabbitMQ style" B1[Broker] -->|pushes events| C1[Consumer] end subgraph "Pull - Kafka style" C2[Consumer] -->|polls for events| B2[Broker] end

Push: The broker pushes events to consumers as they arrive. Lower latency, but if the consumer is slow, the broker can overwhelm it. You need flow control mechanisms.

Pull: The consumer asks for events when it's ready. Natural backpressure — a slow consumer just falls behind in its offset without crashing. The broker doesn't care. This is why Kafka uses pull.


Event Schema and Versioning

This is something I didn't think about at first but it's actually one of the harder operational problems.

Since events are stored in Kafka for potentially months or years, and old consumers still need to read old events, you can't just change the shape of an event freely.

flowchart LR subgraph Schema Evolution V1["v1: { orderId, userId, amount }"] V2["v2: { orderId, userId, amount, currency }"] V3["v3: { orderId, userId, totalAmount, currency, items[] }"] end V1 -->|add optional field| V2 V2 -->|rename + add array| V3

Safe changes: adding optional fields, adding new event types. Dangerous changes: renaming fields, removing fields, changing field types.

The standard approach is to use a schema registry (like Confluent Schema Registry for Kafka) that validates every event against a registered schema before it's published. Consumers fetch the schema by ID embedded in the event. This way, schema changes are versioned and validated.


Where Things Go Wrong

A few failure scenarios I want to keep in mind:

Consumer lag: Consumers fall behind and the lag grows. Usually means consumers are too slow or there's a spike in events. Fix: scale consumers (add more in the group), or optimize processing.

flowchart LR P[Producer\n10k events/sec] --> K[Kafka] K --> C[Consumer\n2k events/sec] style C fill:#ff6b6b

Poison pill: One malformed event that causes a consumer to crash every time it tries to process it. The consumer keeps restarting, re-reading the same event, crashing again. Fix: dead letter queue — after N retries, move the event to a separate topic for manual inspection.

flowchart LR K[Kafka Topic] --> C[Consumer] C -->|fails 3x| DLQ[Dead Letter Queue] DLQ --> Human[Manual Review]

Event ordering issues: If you don't partition by the right key, two events about the same entity can be processed out of order. For example, "order updated" arriving before "order placed" for the same order because they landed in different partitions.


The Tradeoffs, Honestly

What you actually get:

  • Services that don't care about each other at runtime — adding a new consumer is zero changes to existing code
  • Consumers scale independently based on their own workload
  • Kafka keeps the event log, so you can replay history, rebuild state, backfill new consumers
  • Natural audit trail

What it costs:

  • Eventual consistency everywhere — if you need "did inventory update successfully?" synchronously, this model doesn't fit cleanly
  • Operational overhead of running a broker (Kafka clusters are not trivial to manage)
  • Debugging a flow requires tracing events across multiple services and logs
  • Schema evolution requires discipline — one bad schema change and old consumers break

When I'd use it:

  • High-throughput async processing (payments, notifications, analytics pipelines)
  • Fan-out — one event, many independent consumers
  • When services genuinely have different scaling characteristics
  • When audit history and event replay matter

When I wouldn't:

  • Simple request/response where you need an immediate answer
  • Small apps where the broker becomes more complexity than it solves
  • When the team doesn't have the ops capacity to run it reliably

Common Brokers

Broker Model Ordering Replay Good For
Kafka Distributed log, pull Per-partition Yes, indefinitely High throughput, audit logs, event replay
RabbitMQ Message queue, push Per-queue No (once consumed, gone) Task queues, flexible routing, lower volume
AWS SQS Managed queue Best-effort (FIFO queue for strict) No Cloud-native, low ops overhead
AWS SNS Pub/sub fanout No No Fan-out to SQS queues or HTTP endpoints
Redis Streams Lightweight log, pull Per-stream Yes, within retention Simple use cases, already running Redis

The one that comes up most in real backend roles seems to be Kafka, so that's where most of these notes are focused. But the concepts mostly transfer.