June 20, 2026•4 min read

systemsarchitecture

Understanding Distributed Systems: A Practical Primer

A deep dive into the core concepts of distributed systems — consistency, partitioning, and the trade-offs you need to know.

Key takeaways

→Network partitions are inevitable — design for them from day one
→Strong consistency is expensive; eventual consistency is often good enough
→Start simple with a monolith, then distribute only where it helps

Why Distributed Systems Matter

Modern applications rarely run on a single machine. From social media feeds to financial trading platforms, distributed systems are the backbone of the internet. But they come with unique challenges.

A diagram showing three servers connected in a cluster

"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." — Leslie Lamport

This quote captures the essence: distribution adds power, but also complexity. The question isn't whether you'll encounter failures — it's how gracefully you'll handle them.

A common benchmark for system reliability is uptime, usually measured in nines — a system with 99.999% availability is said to have five nines. Achieving this requires careful design of every layer in the stack.

For a deeper look at deployment strategies, check out our guide on Zero-downtime deployments.

The Fallacies of Distributed Computing

Before diving in, let's acknowledge the classic fallacies that trip up engineers:

#	Fallacy	Reality
1	The network is reliable	Packets drop, latency spikes, connections time out
2	Latency is zero	Every network hop costs 1–100ms
3	Bandwidth is infinite	Throttling and congestion are real
4	The network is secure	Assume every packet can be read or forged
5	Topology doesn't change	Servers come and go, networks are repartitioned
6	There is one administrator	Config drift across teams and regions
7	Transport cost is zero	Serialization and I/O add overhead
8	The network is homogeneous	Different OS, kernel versions, and hardware everywhere

System Architecture Overview

A typical distributed system follows a layered architecture:

plaintext

┌─────────────────────────────────────┐
│        Load Balancer (LB)           │
├─────────────────────────────────────┤
│  ┌──────────┐  ┌──────────┐         │
│  │  App A   │  │  App B   │  ...    │
│  └────┬─────┘  └────┬─────┘         │
│       │              │              │
│  ┌────▼──────────────▼─────┐        │
│  │     Cache (Redis)       │        │
│  └────┬────────────────────┘        │
│       │                             │
│  ┌────▼────────────────────┐        │
│  │   Database (Primary)    │        │
│  └────┬────────────────────┘        │
│       │                             │
│  ┌────▼────────────────────┐        │
│  │   Database (Replica)    │        │
│  └─────────────────────────┘        │
└─────────────────────────────────────┘

Consistency Models

Strong Consistency

Every read returns the most recent write. Simple to reason about, but expensive to achieve across nodes. Used in financial systems and user authentication.

sql

BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;

Eventual Consistency

Given enough time without updates, all replicas converge. This is the model behind DNS and many NoSQL databases.

python

# Dynamo-style quorum read
def get(key):
    values = read_from_all_replicas(key)
    # Return the most recent version by timestamp
    return max(values, key=lambda v: v.timestamp)

Comparison Table

Model	Latency	Stale Reads	Write Availability	Use Case
Strong	High	No	Low	Banking, auth
Eventual	Low	Yes	High	DNS, social feeds
Causal	Medium	Rare	Medium	Collaborative editing
Read-Your-Writes	Medium	Session-only	Medium	User dashboards

The CAP Theorem

You can have at most two of three properties:

Consistency — every read sees the last write
Availability — every request gets a response
Partition tolerance — the system works despite network splits

Key insight: Network partitions will happen, so you must choose between CP and AP.

plaintext

     Consistency
        │
        │
  CP ───┼─── CA
        │
        │
  AP ───┼─── Partition Tolerance
        │
     Availability

Real-World Patterns

Leader Election with Raft

type RaftNode struct {
    mu        sync.Mutex
    state     NodeState    // Follower, Candidate, Leader
    term      uint64
    votedFor  string
    log       []LogEntry
    commitIdx uint64
}

Raft ensures that even if a leader crashes, a new one is elected within seconds, and the log stays consistent.

Circuit Breaker Pattern

typescript

class CircuitBreaker {
  private state: "closed" | "open" | "half-open" = "closed"
  private failureCount = 0
  private readonly threshold = 5
 
  async call<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === "open") {
      throw new Error("circuit breaker open")
    }
    try {
      const result = await fn()
      this.reset()
      return result
    } catch (err) {
      this.failureCount++
      if (this.failureCount >= this.threshold) {
        this.state = "open"
        setTimeout(() => { this.state = "half-open" }, 30_000)
      }
      throw err
    }
  }
}

Putting It Into Practice

Start simple — build a single-node system first, then add replication, then sharding. Each step introduces new challenges, but also new capabilities.

Here's a deployment flow for a typical service:

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  selector:
    matchLabels:
      app: api
  template:
    spec:
      containers:
        - name: api
          image: api:latest
          ports:
            - containerPort: 8080
          resources:
            requests:
              memory: "256Mi"
              cpu: "250m"

Pro tip: Always run with at least 3 replicas in production. Two isn't enough — you lose quorum on a single failure.

Summary

Distributed systems are complex but manageable when you understand the fundamental trade-offs. Start with the consistency model that fits your use case, plan for partitions, and gradually add sophistication as your scale demands it.

Next up: Zero-downtime deployments — blue-green, canary, and rolling update strategies.

stay in the loop

get notified when new articles drop. no spam, ever.

Prefer RSS? Subscribe via RSS →