Understanding Distributed Systems: A Practical Primer
A deep dive into the core concepts of distributed systems — consistency, partitioning, and the trade-offs you need to know.
Key takeaways
- →Network partitions are inevitable — design for them from day one
- →Strong consistency is expensive; eventual consistency is often good enough
- →Start simple with a monolith, then distribute only where it helps
Why Distributed Systems Matter
Modern applications rarely run on a single machine. From social media feeds to financial trading platforms, distributed systems are the backbone of the internet. But they come with unique challenges.
"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." — Leslie Lamport
This quote captures the essence: distribution adds power, but also complexity. The question isn't whether you'll encounter failures — it's how gracefully you'll handle them.
A common benchmark for system reliability is uptime, usually measured in nines — a system with 99.999% availability is said to have five nines. Achieving this requires careful design of every layer in the stack.
For a deeper look at deployment strategies, check out our guide on Zero-downtime deployments.
The Fallacies of Distributed Computing
Before diving in, let's acknowledge the classic fallacies that trip up engineers:
| # | Fallacy | Reality |
|---|---|---|
| 1 | The network is reliable | Packets drop, latency spikes, connections time out |
| 2 | Latency is zero | Every network hop costs 1–100ms |
| 3 | Bandwidth is infinite | Throttling and congestion are real |
| 4 | The network is secure | Assume every packet can be read or forged |
| 5 | Topology doesn't change | Servers come and go, networks are repartitioned |
| 6 | There is one administrator | Config drift across teams and regions |
| 7 | Transport cost is zero | Serialization and I/O add overhead |
| 8 | The network is homogeneous | Different OS, kernel versions, and hardware everywhere |
System Architecture Overview
A typical distributed system follows a layered architecture:
┌─────────────────────────────────────┐
│ Load Balancer (LB) │
├─────────────────────────────────────┤
│ ┌──────────┐ ┌──────────┐ │
│ │ App A │ │ App B │ ... │
│ └────┬─────┘ └────┬─────┘ │
│ │ │ │
│ ┌────▼──────────────▼─────┐ │
│ │ Cache (Redis) │ │
│ └────┬────────────────────┘ │
│ │ │
│ ┌────▼────────────────────┐ │
│ │ Database (Primary) │ │
│ └────┬────────────────────┘ │
│ │ │
│ ┌────▼────────────────────┐ │
│ │ Database (Replica) │ │
│ └─────────────────────────┘ │
└─────────────────────────────────────┘Consistency Models
Strong Consistency
Every read returns the most recent write. Simple to reason about, but expensive to achieve across nodes. Used in financial systems and user authentication.
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;Eventual Consistency
Given enough time without updates, all replicas converge. This is the model behind DNS and many NoSQL databases.
# Dynamo-style quorum read
def get(key):
values = read_from_all_replicas(key)
# Return the most recent version by timestamp
return max(values, key=lambda v: v.timestamp)Comparison Table
| Model | Latency | Stale Reads | Write Availability | Use Case |
|---|---|---|---|---|
| Strong | High | No | Low | Banking, auth |
| Eventual | Low | Yes | High | DNS, social feeds |
| Causal | Medium | Rare | Medium | Collaborative editing |
| Read-Your-Writes | Medium | Session-only | Medium | User dashboards |
The CAP Theorem
You can have at most two of three properties:
- Consistency — every read sees the last write
- Availability — every request gets a response
- Partition tolerance — the system works despite network splits
Key insight: Network partitions will happen, so you must choose between CP and AP.
Consistency
│
│
CP ───┼─── CA
│
│
AP ───┼─── Partition Tolerance
│
AvailabilityReal-World Patterns
Leader Election with Raft
type RaftNode struct {
mu sync.Mutex
state NodeState // Follower, Candidate, Leader
term uint64
votedFor string
log []LogEntry
commitIdx uint64
}Raft ensures that even if a leader crashes, a new one is elected within seconds, and the log stays consistent.
Circuit Breaker Pattern
class CircuitBreaker {
private state: "closed" | "open" | "half-open" = "closed"
private failureCount = 0
private readonly threshold = 5
async call<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === "open") {
throw new Error("circuit breaker open")
}
try {
const result = await fn()
this.reset()
return result
} catch (err) {
this.failureCount++
if (this.failureCount >= this.threshold) {
this.state = "open"
setTimeout(() => { this.state = "half-open" }, 30_000)
}
throw err
}
}
}Putting It Into Practice
Start simple — build a single-node system first, then add replication, then sharding. Each step introduces new challenges, but also new capabilities.
Here's a deployment flow for a typical service:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-service
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
selector:
matchLabels:
app: api
template:
spec:
containers:
- name: api
image: api:latest
ports:
- containerPort: 8080
resources:
requests:
memory: "256Mi"
cpu: "250m"Pro tip: Always run with at least 3 replicas in production. Two isn't enough — you lose quorum on a single failure.
Summary
Distributed systems are complex but manageable when you understand the fundamental trade-offs. Start with the consistency model that fits your use case, plan for partitions, and gradually add sophistication as your scale demands it.
Next up: Zero-downtime deployments — blue-green, canary, and rolling update strategies.
stay in the loop
get notified when new articles drop. no spam, ever.
Prefer RSS? Subscribe via RSS →