Designing for Partial Failure: Why 'Everything is Highly Available' Is a Myth

#architecture #distributedsystems #sre #systemdesign

Your system will fail. The question is whether it fails completely or gracefully — and that answer is decided at design time, not incident time.

High availability is not a property of your system — it is an emergent behavior of how your system handles the inevitable failure of its parts. A cluster of five-nines components can still produce a zero-nines system if you haven't designed for what happens when one of them degrades.

Partial Failure Is the Normal State

The CAP theorem is clean in theory. In production, it manifests as partial unavailability — some nodes respond, some don't, and your system has to decide what to do about it.

Real-world partitions are never clean:

A replica is reachable but 800ms slower than normal
A downstream service responds to health checks but times out on actual requests
A database primary is alive, but replication lag has grown to 45 seconds

Degraded systems are harder to reason about than failed ones. A service that returns errors is visible. A service that returns stale data silently is far more dangerous.

Practical rule: for every external dependency, explicitly answer — "What does my service do when this dependency is unavailable for 30 seconds? For 5 minutes? For 30 minutes?"

How Cascading Failures Actually Propagate

Cascading failures rarely start big. They start with one slow service and end with everything down.

The classic thread pool exhaustion cascade:

A payment service starts responding in 4s instead of 200ms
Threads pile up — connection pool goes from 20% to 80%
Latency bleeds upstream — unrelated operations slow down because they share the same thread pool
Connection pools hit their limit — new requests fail immediately
Health checks fail — load balancer removes the service from rotation

Your entire checkout is down because a payment service was slow — not even failed.

Cascade enablers to eliminate:

Synchronous chains longer than 2–3 hops
Shared thread pools across dependencies
Missing or oversized timeouts (a 30s default in a p99 200ms service is a loaded gun)
Retry storms without backoff
No bulkhead isolation between operations

Graceful Degradation Patterns

Circuit Breaker — monitors failure rate and opens the circuit above a threshold, immediately returning a fallback instead of attempting the call. States: Closed → Open → Half-Open.

Bulkhead Isolation — prevents one failing dependency from consuming all shared resources. Isolate thread/connection pools per downstream service. At the Kubernetes level, namespace ResourceQuotas enforce cluster-level isolation.

Timeout Hierarchy — every downstream call's timeout must be shorter than the upstream caller's timeout. A 5s payment timeout inside a 20s checkout timeout inside a 30s user request timeout.

Fallback Responses — return degraded but functional responses rather than errors. Pricing service down? Return last-cached price with a "price may vary" indicator. Feature flags unavailable? Return safe defaults.

Retry with Exponential Backoff + Jitter — retries without backoff amplify load on degraded services. Jitter prevents synchronized retry storms. Only retry on transient failures (5xx, timeouts) — never on 4xx.

Observability During Partial Failure

Graceful degradation can mask serious problems for hours. The four signals that matter most:

Circuit breaker state transitions — a circuit that opens at 2am with no alert is silent failure accumulating debt
Per-dependency error rates — aggregate 0.5% looks fine; per-dependency 40% on payment tells you exactly what's wrong
Queue depth and consumer lag — the leading indicator before errors surface at the API layer
Fallback invocation rate — 0.1% is noise; 15% is a dependency in chronic distress being silently masked

The Key Insight

Most architecture reviews start with the happy path. Flip that. Design your degraded states first — what does this system look like when the payment service is down? When a network partition isolates one AZ?

If you can answer with specific, tested, observable behaviors — you have a resilient system. If the answer is "it depends on what fails" — you have a system that will surprise you in production.

Partial failure is not an edge case. It is the normal operating condition of any distributed system at scale.

Read the Full Article

This is a summary of my deep dive into designing for partial failure. The full article covers each pattern with production implementation examples, real-world cascade case studies, and a complete resilience checklist:

👉 Designing for Partial Failure — Full Article

The full article includes: