Back to posts

Designing for Resilience

Systems will fail. The question is whether failures cascade into outages or are contained and recovered automatically.

Failure Modes

Design for:

  • Network partitions
  • Service unavailability
  • Data corruption
  • Capacity exhaustion
  • Configuration errors

Patterns for Resilience

Circuit Breaker: Stop calling failing services

@circuit_breaker(failure_threshold=5, recovery_timeout=30)
def call_service():
    return requests.get(url)

Bulkhead: Isolate failures to prevent cascade

Retry with Backoff: Handle transient failures

@retry(wait=wait_exponential(multiplier=1, max=60))
def flaky_operation():
    return do_something()

Chaos Engineering

Intentionally inject failures:

  • Kill random instances
  • Add network latency
  • Fill disks
  • Exhaust memory

Find weaknesses before they find you.

Game Days

Practice incident response. Run through scenarios. Build muscle memory for high-stress situations.