[backend][system-design][distributed-systems][architecture]

Backend Concepts Worth Actually Understanding

Backend Concepts Worth Actually Understanding
> My personal notes on 10 backend concepts I am working through — what they are, how I understand them, and the things that tripped me up.

This is my running record of backend concepts I keep hearing about and want to actually understand, not just recognize the name of. I'm writing these in my own words as I study them.


1. Event-Driven Architecture

Instead of Service A directly calling Service B, Service A just says "hey, this thing happened" and publishes an event. Then any service that cares about that event can react to it on its own time.

The mental shift for me was: the producer doesn't know or care who's listening. That's what makes it decoupled. So if you add a new service that needs to react to "order placed", you don't need to touch the order service at all — you just subscribe to the event.

The downside I keep reading about is that it's harder to trace what actually happened when something goes wrong. With a direct call you can follow the chain. With events it's more scattered. You also need a message broker like Kafka or RabbitMQ sitting in the middle, which is another thing that can fail.


2. Saga Pattern (Choreography vs Orchestration)

This came up when I was reading about how you handle transactions that span multiple services. You can't just do a database transaction across services — that's not a thing. So instead you chain together a series of smaller transactions, and if one fails, you run "compensating" steps to undo what already happened.

There are two ways to coordinate this:

Choreography is where each service just listens for events and decides what to do next. No one is in charge. The flow emerges from all the services reacting to each other. Feels elegant but I can see how it gets hard to follow when the flow gets complex.

Orchestration is where one service (the orchestrator) acts as the conductor — it explicitly tells each service what to do and waits for the response. The full flow lives in one place which makes it easier to read and debug.

The tricky part I keep seeing mentioned is writing the compensating transactions correctly. Not every action can be cleanly undone. "Cancel the charge" is easy. "Unsend the email" is not.


3. CQRS (Command Query Responsibility Segregation)

The idea is to split how you write data from how you read data. Instead of one model that does both, you have a Command side (writes, updates, deletes) and a Query side (reads).

Why would you want this? Because read and write patterns are usually pretty different. Writes care about validation and consistency. Reads care about being fast and often need data shaped differently — like aggregated or joined in ways that don't match the write model.

So in practice, the write side stores data in a normalized way, and the read side maintains its own denormalized views that are optimized for queries. Those views get updated asynchronously, usually through events from the write side.

I want to be careful not to reach for this everywhere though. For most simple CRUD things, it's overkill and just adds complexity for no reason.


4. Event Sourcing (and When NOT to Use It)

Normal systems store the current state of something. Event sourcing stores every event that ever happened to it, and you derive the current state by replaying the history.

The cool thing is you get a full audit log built in. You can also replay events to build new views or fix bugs by re-processing history.

But the costs are real and I want to make sure I remember them:

  • If you need to change the shape of an event later, you still need to be able to read the old ones — schema migration is painful
  • Replaying a long event history to get current state is slow, so you need snapshots
  • Querying is weird without a separate read model (this pairs naturally with CQRS)

Honestly from what I've read, most systems don't need this. It's a pattern for specific problems, not a default way to build things. If you just need good logging and audit trails, a regular database with good logging might be enough.


5. Circuit Breaker Pattern (Plus Retries, Timeouts, Bulkheads)

These four things go together and are all about handling failures gracefully when calling other services.

Circuit Breaker: If a service you're calling keeps failing, at some point you should just stop calling it for a while instead of hammering it with requests. The circuit breaker tracks failure rate and when it crosses a threshold it "opens" — meaning calls fail immediately without even trying. After some time it lets a few requests through to test if the service recovered.

Retries: When a call fails, try again. But you shouldn't retry immediately in a loop — you should wait a bit longer each time (exponential backoff) and add some randomness (jitter) so all your clients don't retry at exactly the same moment and flood the service.

Timeouts: Every network call needs a timeout. If you don't set one and the other service hangs forever, your threads will just sit there waiting and eventually you run out of them.

Bulkheads: Keep separate resource pools for separate dependencies. So if Service B goes slow and fills up its connection pool, Service C still has its own pool and isn't affected. Borrowed from how ships are built with separate watertight compartments.

The insight that stuck with me: these don't work well in isolation. Retries without a circuit breaker just keep making things worse. A circuit breaker without a timeout doesn't help much.


6. Distributed Tracing (Trace IDs, Spans, Baggage, Sampling)

When a request goes through multiple services, how do you track it end to end? That's what distributed tracing solves.

Trace ID: A unique ID generated when the request first enters your system. It gets passed along in HTTP headers through every service. Every log line includes this ID so you can filter and see the full journey of that request.

Spans: Each unit of work within a trace is a span — like a single DB query or a call to another service. Spans have start/end times and a parent span, so the full trace forms a tree. This is how you see where time is being spent.

Baggage: Extra key-value data you attach to the trace context that also gets propagated. Useful for things like tenant ID or feature flags. But it adds overhead to every request so you shouldn't put a lot of stuff in there.

Sampling: You can't trace every single request at scale, it would be too much data. So you sample — only trace some percentage of requests. Head-based sampling decides at the start (simple but you might miss rare bugs). Tail-based sampling waits until the request is done and then decides based on outcome (you can prioritize errors and slow requests, but it's more expensive). OpenTelemetry is the standard I keep seeing for this.


7. CAP Theorem (and What It Means in Real Systems)

CAP says: in a distributed system, when there's a network partition (nodes can't talk to each other), you can only pick one of:

  • Consistency — every read gets the most recent write
  • Availability — every request gets a response (might be stale)

This confused me at first because I thought it meant you always have to choose between consistency and availability. But the key detail is: you're only forced to choose during a partition. When things are running fine, you can have both.

So the real question is: when something breaks and nodes get split, what do you want your system to do? Return an error to keep data correct (CP), or return potentially stale data to stay up (AP)?

CP examples: etcd, Zookeeper — they'll refuse to respond from a minority partition. AP examples: Cassandra, DynamoDB in eventually-consistent mode — they keep responding and resolve conflicts later.

Something I want to remember: CAP doesn't say anything about latency. There's another model called PACELC that also accounts for the latency tradeoff that exists even when there's no partition.


8. Idempotency (Keys, Dedup, Exactly-Once Illusions)

An operation is idempotent if you can run it multiple times and get the same result as running it once. GET requests are naturally idempotent. DELETE is mostly idempotent. POST to create a record is not — you'll get duplicates.

Why this matters: networks fail, clients retry, message queues deliver things more than once. If your operations aren't idempotent, retries cause real problems — duplicate payments, duplicate emails, double-created records.

Idempotency keys: The client generates a unique ID for each logical operation (like a payment). It sends this key with the request. The server stores the key and the result. If the same key comes in again, the server returns the stored result instead of running the operation again. Stripe does this for payments.

The thing I want to remember: "exactly-once delivery" is basically impossible in distributed systems. What you actually get is at-least-once delivery combined with deduplication on the consumer side, which gives you effectively-once behavior. It's not truly exactly-once — it's just idempotent handling of duplicates.


9. Data Sharding (Routing, Rebalancing, Hot Partitions)

Sharding is splitting your data across multiple database nodes so no single node has to hold everything or handle all the traffic.

How do you decide which shard a piece of data goes to?

  • Hash-based: run the key through a hash function, use the result to pick a shard. Even distribution but adding/removing shards is painful because most keys need to remap (consistent hashing helps with this).
  • Range-based: each shard owns a range of keys (like A-F, G-M). Good for range queries but sequential writes all pile up on one shard.
  • Directory-based: a separate service maps keys to shards. Flexible but now you have another thing that can fail.

Rebalancing is what happens when you add a new shard — you have to move data around. The goal is to do this without downtime.

Hot partitions are when one shard gets way more traffic than others — like if you shard by user ID and a celebrity has millions of requests. One solution I've seen mentioned is adding a random suffix to the key to spread it across multiple shards, then merging the results on read.


10. API Gateway (Auth, Rate Limits, Routing, Policy, Observability)

An API gateway sits in front of all your services and handles the stuff that every service would otherwise have to implement itself.

Auth: Validate JWTs or OAuth tokens once at the gateway. The individual services can just trust the forwarded headers instead of each having their own auth logic.

Rate limiting: Control how many requests a client can make. Stops abuse and protects services from getting flooded. Token bucket and sliding window are the common algorithms. Better to stop traffic here than let it reach your services.

Routing: The gateway maps paths to services. This is also how you do versioning (/v1/ vs /v2/) or canary releases where you send a small percentage of traffic to a new version.

Policy stuff: TLS termination, CORS headers, IP blocking, response caching — centralize all of this so you don't have 10 services each doing it slightly differently.

Observability: Because all traffic flows through the gateway, it's the natural place to collect metrics on request rates, error rates, and latency for every service at once. You can also inject trace IDs here so every request is traceable from the start.