SANATSU.BLOG
June 20, 20264 min read
systemsarchitecture

Understanding Distributed Systems: A Practical Primer

A deep dive into the core concepts of distributed systems — consistency, partitioning, and the trade-offs you need to know.

Key takeaways

  • Network partitions are inevitable — design for them from day one
  • Strong consistency is expensive; eventual consistency is often good enough
  • Start simple with a monolith, then distribute only where it helps

Why Distributed Systems Matter

Modern applications rarely run on a single machine. From social media feeds to financial trading platforms, distributed systems are the backbone of the internet. But they come with unique challenges.

A diagram showing three servers connected in a cluster

"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." — Leslie Lamport

This quote captures the essence: distribution adds power, but also complexity. The question isn't whether you'll encounter failures — it's how gracefully you'll handle them.

A common benchmark for system reliability is uptime, usually measured in nines — a system with 99.999% availability is said to have five nines. Achieving this requires careful design of every layer in the stack.

For a deeper look at deployment strategies, check out our guide on Zero-downtime deployments.

The Fallacies of Distributed Computing

Before diving in, let's acknowledge the classic fallacies that trip up engineers:

#FallacyReality
1The network is reliablePackets drop, latency spikes, connections time out
2Latency is zeroEvery network hop costs 1–100ms
3Bandwidth is infiniteThrottling and congestion are real
4The network is secureAssume every packet can be read or forged
5Topology doesn't changeServers come and go, networks are repartitioned
6There is one administratorConfig drift across teams and regions
7Transport cost is zeroSerialization and I/O add overhead
8The network is homogeneousDifferent OS, kernel versions, and hardware everywhere

System Architecture Overview

A typical distributed system follows a layered architecture:

plaintext
┌─────────────────────────────────────┐
│        Load Balancer (LB)           │
├─────────────────────────────────────┤
│  ┌──────────┐  ┌──────────┐         │
│  │  App A   │  │  App B   │  ...    │
│  └────┬─────┘  └────┬─────┘         │
│       │              │              │
│  ┌────▼──────────────▼─────┐        │
│  │     Cache (Redis)       │        │
│  └────┬────────────────────┘        │
│       │                             │
│  ┌────▼────────────────────┐        │
│  │   Database (Primary)    │        │
│  └────┬────────────────────┘        │
│       │                             │
│  ┌────▼────────────────────┐        │
│  │   Database (Replica)    │        │
│  └─────────────────────────┘        │
└─────────────────────────────────────┘

Consistency Models

Strong Consistency

Every read returns the most recent write. Simple to reason about, but expensive to achieve across nodes. Used in financial systems and user authentication.

sql
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;

Eventual Consistency

Given enough time without updates, all replicas converge. This is the model behind DNS and many NoSQL databases.

python
# Dynamo-style quorum read
def get(key):
    values = read_from_all_replicas(key)
    # Return the most recent version by timestamp
    return max(values, key=lambda v: v.timestamp)

Comparison Table

ModelLatencyStale ReadsWrite AvailabilityUse Case
StrongHighNoLowBanking, auth
EventualLowYesHighDNS, social feeds
CausalMediumRareMediumCollaborative editing
Read-Your-WritesMediumSession-onlyMediumUser dashboards

The CAP Theorem

You can have at most two of three properties:

  • Consistency — every read sees the last write
  • Availability — every request gets a response
  • Partition tolerance — the system works despite network splits

Key insight: Network partitions will happen, so you must choose between CP and AP.

plaintext
     Consistency


  CP ───┼─── CA


  AP ───┼─── Partition Tolerance

     Availability

Real-World Patterns

Leader Election with Raft

go
type RaftNode struct {
    mu        sync.Mutex
    state     NodeState    // Follower, Candidate, Leader
    term      uint64
    votedFor  string
    log       []LogEntry
    commitIdx uint64
}

Raft ensures that even if a leader crashes, a new one is elected within seconds, and the log stays consistent.

Circuit Breaker Pattern

typescript
class CircuitBreaker {
  private state: "closed" | "open" | "half-open" = "closed"
  private failureCount = 0
  private readonly threshold = 5
 
  async call<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === "open") {
      throw new Error("circuit breaker open")
    }
    try {
      const result = await fn()
      this.reset()
      return result
    } catch (err) {
      this.failureCount++
      if (this.failureCount >= this.threshold) {
        this.state = "open"
        setTimeout(() => { this.state = "half-open" }, 30_000)
      }
      throw err
    }
  }
}

Putting It Into Practice

Start simple — build a single-node system first, then add replication, then sharding. Each step introduces new challenges, but also new capabilities.

Here's a deployment flow for a typical service:

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  selector:
    matchLabels:
      app: api
  template:
    spec:
      containers:
        - name: api
          image: api:latest
          ports:
            - containerPort: 8080
          resources:
            requests:
              memory: "256Mi"
              cpu: "250m"

Pro tip: Always run with at least 3 replicas in production. Two isn't enough — you lose quorum on a single failure.

Summary

Distributed systems are complex but manageable when you understand the fundamental trade-offs. Start with the consistency model that fits your use case, plan for partitions, and gradually add sophistication as your scale demands it.

Next up: Zero-downtime deployments — blue-green, canary, and rolling update strategies.

stay in the loop

get notified when new articles drop. no spam, ever.