Designing for Failure
- system-design
- backend
- distributed-systems
- reliability
- architecture
The Story: The City That Never Sleeps
A great city doesn’t shut down when a water main breaks, traffic reroutes, backup generators kick in, or emergency services redirect. The city has redundancy, fallback plans, and graceful degradation baked into its infrastructure.
Your system must be the same. Not “will things fail?” — they will. The question is: how does your system behave when they do?
The Reliability Vocabulary
Availability
The percentage of time your system is operational and serving requests.
Availability = uptime / (uptime + downtime) × 100%
| SLA | Downtime per year | Downtime per month |
|---|---|---|
| 99% (“two nines”) | 3.65 days | 7.3 hours |
| 99.9% (“three nines”) | 8.77 hours | 43.8 minutes |
| 99.99% (“four nines”) | 52.6 minutes | 4.4 minutes |
| 99.999% (“five nines”) | 5.26 minutes | 26 seconds |
Memory anchor: Five nines = you can only afford 26 seconds of downtime per month. That’s the standard for critical infrastructure (banking, telecom, air traffic control).
Systems in series vs parallel
Series (components must ALL work):
[A] → [B] → [C]
Availability = A × B × C
= 0.99 × 0.99 × 0.99
= 0.970 (97%) ← WORSE than each individual component
Parallel (one must work):
[A]
[in] → [B] → [out]
[C]
Availability = 1 - (1-A)(1-B)(1-C)
= 1 - (0.01)(0.01)(0.01)
= 0.999999 (99.9999%) ← MUCH BETTER
Key insight: Adding redundancy in parallel dramatically increases availability. This is the fundamental reason for replication, clustering, and multi-AZ deployments.
MTTR and MTBF
| Metric | Stands for | What it measures |
|---|---|---|
| MTBF | Mean Time Between Failures | Average uptime between incidents |
| MTTR | Mean Time To Recovery | Average time to fix after failure |
Availability ≈ MTBF / (MTBF + MTTR)
To improve availability: Increase MTBF (fewer failures) OR decrease MTTR (faster recovery). MTTR is often more actionable.
Failure Modes: What Can Go Wrong
Hardware failures
- Server dies, disk fails, NIC fails
- Solution: Redundant hardware, RAID, multiple availability zones
Software failures
- Memory leaks, deadlocks, infinite loops, crashes
- Solution: Circuit breakers, watchdogs, automatic restarts, chaos engineering
Network failures
- Packet loss, latency spikes, partition (two parts of the network can’t reach each other)
- Solution: Retries with backoff, timeouts, circuit breakers
Human errors
- Misconfiguration, bad deployments, accidental deletes
- Solution: Deployment safeguards, rollback mechanisms, access controls, backup policies
Cascading failures
- One service slows → queues back up → retries increase → more load → adjacent services slow → entire system collapses
- Solution: Circuit breakers, bulkheads, load shedding
Redundancy: The Foundation
Redundancy = having more than one of something critical.
Active-Passive (Primary-Standby)
[Load Balancer] → [Active Server] ← does all the work
[Passive Server] ← standby, watching, ready
If Active fails:
[Load Balancer] → [Passive Server] ← takes over (failover)
Failover time: Seconds to minutes (passive must detect failure and warm up)
Resource efficiency: Passive is “wasted” — it’s idle until needed
Use for: Databases (primary-replica), systems where split-brain is dangerous
Active-Active
[Active Server 1] ← handles 50% of traffic
[LB] →
[Active Server 2] ← handles 50% of traffic
If Server 1 fails:
[LB] → [Active Server 2] ← handles 100% of traffic (degraded, not down)
Failover time: Zero (LB instantly reroutes)
Resource efficiency: Both servers doing work
Use for: Stateless app servers, read replicas, multi-region setups
Geographic redundancy
Region A (Mumbai) Region B (Singapore)
[LB] → [Servers] → [DB] [LB] → [Servers] → [DB]
↕ (replicated) ↕
Single datacenter disaster (fire, flood, power grid failure) doesn’t take down your system.
AWS terminology:
- Availability Zones (AZs) = isolated datacenters within a region.
- Multi-AZ = redundancy within a region.
- Multi-Region = redundancy across geographies.
Circuit Breaker Pattern
Story: Your home has a circuit breaker. When too much current flows (overload), it trips and cuts power rather than letting wires overheat and start a fire. Once the problem is fixed, you reset the breaker.
In software: When a downstream service is failing, stop calling it. Return an immediate error or fallback. Give it time to recover.
States
success rate high
↗ ↘
CLOSED ─────────────────────── OPEN
(normal) failure (stop calling)
threshold hit │
wait
│
HALF-OPEN ← (try one request)
│
success? → CLOSED
failure? → OPEN
CLOSED: Normal operation. Requests pass through. Count failures.
OPEN: Failure threshold exceeded. Immediately return error/fallback. No requests to downstream.
HALF-OPEN: After timeout, allow one test request. Success → CLOSED. Failure → OPEN.
Implementation pattern
class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=60):
self.state = "CLOSED"
self.failure_count = 0
self.failure_threshold = failure_threshold
self.last_failure_time = None
self.timeout = timeout
def call(self, func, *args):
if self.state == "OPEN":
if time.time() - self.last_failure_time > self.timeout:
self.state = "HALF-OPEN"
else:
raise Exception("Circuit open — fast fail")
try:
result = func(*args)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _on_success(self):
self.failure_count = 0
self.state = "CLOSED"
def _on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "OPEN"
Tools: Netflix Hystrix (Java), Resilience4j, Polly (.NET), Python pybreaker
Retry with Exponential Backoff
Story: You call a restaurant. Line busy. You call again immediately — still busy. And again. You’re making it worse by adding load. Smart approach: wait a bit, try again. Wait longer, try again. Eventually stop trying.
The pattern
import time
import random
def retry_with_backoff(func, max_retries=5, base_delay=1):
for attempt in range(max_retries):
try:
return func()
except TransientError as e:
if attempt == max_retries - 1:
raise # final attempt — give up
# Exponential backoff with jitter
delay = (2 ** attempt) * base_delay + random.uniform(0, 1)
time.sleep(delay)
# Attempt 0: sleep ~1s
# Attempt 1: sleep ~2s
# Attempt 2: sleep ~4s
# Attempt 3: sleep ~8s
# Attempt 4: sleep ~16s
Jitter is critical: Without jitter, all clients that failed at the same time retry at the same time — thundering herd. Random jitter spreads the retries out.
What to retry vs what not to
| Error type | Retry? | Why |
|---|---|---|
| Network timeout | Yes | Transient — service may be temporarily slow |
| 503 Service Unavailable | Yes | Service temporarily overloaded |
| 429 Too Many Requests | Yes, after backoff | Rate limit — back off and retry |
| 400 Bad Request | No | Your request is malformed — retrying won’t help |
| 401 Unauthorized | No | Auth issue — retrying won’t help |
| 404 Not Found | No | Resource doesn’t exist |
| 500 Internal Server Error | Sometimes | Depends on idempotency |
Idempotency matters: Only retry if the operation is idempotent (same result if called multiple times). GET is idempotent. POST that creates a resource is NOT (retrying creates duplicates).
Bulkhead Pattern
Story: A ship’s hull is divided into watertight compartments (bulkheads). If one section floods, it doesn’t sink the whole ship.
In software: Isolate failures in one component so they don’t cascade to others.
Thread pool isolation
Without bulkheads:
[App] has 100 thread pool
Service A gets slow → consumes 90 threads waiting
Service B calls start queuing → waiting for threads → now it gets also slow/down even though it is fine
With bulkheads:
Service A → dedicated pool of 30 threads
Service B → dedicated pool of 30 threads
Service C → dedicated pool of 30 threads
Service A degrades → only those 30 threads affected
Service B & C continues normally
Resource bulkheads
Separate resource pools for different criticality levels:
Critical path (checkout, payment): large thread pool, high priority
Non-critical (recommendations, analytics): small pool, low priority
Payment service slow? Recommendations are unaffected.
Timeouts: The Forgotten Hero
The danger of no timeouts: One slow upstream service causes your threads to hang forever, exhausting your thread pool, bringing down your entire service.
# Always set timeouts on external calls
import requests
# Tuple: (connection_timeout, read_timeout)
response = requests.get(
"https://api.payment.com/charge",
timeout=(3, 10) # 3s to connect, 10s to read
)
Timeout hierarchy
User's browser timeout (30s)
└── API Gateway timeout (25s)
└── Service A timeout (20s)
└── Service B timeout (15s)
└── DB query timeout (10s)
Rule: Each layer’s timeout < the caller’s timeout. This ensures the inner layer fails first, giving the outer layer time to handle it gracefully.
Rate Limiting: Protecting Your System
Rate limiting prevents any single client from overwhelming your system.
Token Bucket Algorithm
Bucket has N tokens.
Each request consumes 1 token.
Tokens refill at rate R/second.
Request arrives → tokens available? → yes: process, consume token → no: reject (429)
Allows bursts up to bucket size. Good for: APIs with burst tolerance.
Leaky Bucket Algorithm
Requests enter a queue (bucket). Queue drains at fixed rate.
Queue full → new requests dropped.
Smooths traffic to a constant rate. No bursts allowed. Good for: metered APIs.
Fixed Window Counter Algorithm
Window: 0–60s → max 100 requests
At second 59: user sends request #100 → OK
At second 61: counter resets → user sends 100 more immediately
Problem: user can send 200 requests in 2 seconds across the boundary
Sliding Window Log Algorithm
Track exact timestamps of all requests. Count requests in past 60 seconds.
Timestamps: [59.1, 59.3, 59.7, 59.9, 60.1, 60.4]
At 60.5: count in window [0.5–60.5] = 4 → OK
Accurate but memory-intensive (store every timestamp).
Sliding Window Counter Algorithm (best practical choice)
current_window_count + (previous_window_count × overlap_percent)
If: prev_count=84, cur_count=36, overlap=70%
Estimated = 36 + 84×0.7 = 94.8 → under 100 → allow
Good balance of accuracy and memory efficiency.
Graceful Degradation
Story: Netflix is having a problem with its recommendation engine. Instead of returning an error to users, it shows “Popular movies in your region” — a simpler, always-available fallback. The page still loads, just with less personalization.
Core principle: When a non-critical component fails, degrade gracefully — show something useful rather than an error.
Strategies
| Failure | Bad response | Graceful degradation |
|---|---|---|
| Recommendation service down | Error 500 | Show “Popular in your region” |
| Search service slow | Timeout + error | Show cached results from 1 hour ago |
| Payment service down | ”Error processing" | "We’re experiencing issues. Try again in a few minutes” |
| Image CDN failing | Broken images | Show placeholder icons |
| Analytics service down | Log loss | Drop analytics silently — user unaffected |
Feature flags
Turn features on/off without deployment:
def show_recommendations(user_id):
if feature_flag("recommendations_enabled"):
try:
return recommendation_service.get(user_id)
except Exception:
return popular_items() # fallback
return popular_items() # feature disabled globally
Tools: LaunchDarkly, Flagsmith, AWS AppConfig
Health Checks and Monitoring
Your system can’t fix what it can’t detect.
Health check types
Shallow health check: Is the process running?
GET /health → 200 OK (just proves server is alive)
Deep health check: Are all dependencies working?
GET /health/deep →
{
"status": "degraded",
"checks": {
"database": "ok",
"redis": "ok",
"payment_service": "failing",
"search": "ok"
}
}
Important: Don’t route real traffic based on deep health checks — a failing non-critical dependency shouldn’t remove a server from the load balancer pool.
The four golden signals (Google SRE)
| Signal | What to monitor | Alert when |
|---|---|---|
| Latency | Request duration (p50, p99, p999) | p99 > SLA threshold |
| Traffic | Requests per second | Anomalous spike or drop |
| Errors | Error rate (5xx, 4xx) | Error rate > 1% |
| Saturation | CPU, memory, disk, queue depth | > 80% utilization |
Flashcards
Q: What is the circuit breaker pattern?
A resilience pattern that stops calling a failing service (OPEN state) to prevent cascading failures. After a timeout, it tests with one request (HALF-OPEN) before returning to normal (CLOSED).
Q: What is exponential backoff with jitter?
Retry with increasing delays (2^attempt × base) plus random jitter to prevent all clients retrying simultaneously.
Q: What is the bulkhead pattern?
Isolating components into separate resource pools so failure in one doesn’t exhaust resources for others.
Q: What are the four golden signals?
Latency, Traffic, Errors, Saturation.
Q: What is the difference between active-passive and active-active redundancy?
Active-passive: one handles traffic, standby waits (failover takes time). Active-active: both handle traffic simultaneously (zero failover time).
Q: What is graceful degradation?
When a non-critical component fails, the system continues with reduced functionality rather than returning an error.
Q: How do you design for failure?
Framework answer:
- Redundancy: Every critical component is replicated. No single points of failure.
- Circuit breakers: Downstream failures don’t cascade. Fast fail with fallback.
- Retry with backoff and jitter: Transient failures are automatically retried without thundering herd.
- Timeouts everywhere: No hanging threads. Each call has a deadline.
- Bulkheads: Critical and non-critical paths are isolated.
- Graceful degradation: Non-critical features fail silently; core functionality continues.
- Health checks and monitoring: Detect failures fast. MTTR depends on detection time.
Q: What is a SPOF (Single Point of Failure) checklist?
For every component in your design: “What happens if this dies?”
- Single app server → add LB + multiple servers
- Single DB → primary-replica + automatic failover
- Single cache node → Redis Cluster or Sentinel
- Single load balancer → active-passive LB pair
- Single datacenter → multi-AZ or multi-region