Post-mortem: two Claude outages in 48 hours and what actually broke in our failover
April 6, 2:14am PST: Claude API goes down. 100% error rate across LLM-dependent services. Duration: ~2 hours.
April 7: happens again. Slightly shorter, same impact.
We’d been treating the Claude API like a database. Reliable enough that we didn’t build proper failover around it. That assumption didn’t hold.
The failure wasn’t just “provider down.” It was our failover logic.
We had a circuit breaker, but it was tuned for short spikes. A multi-hour outage exceeded the retry window, so the circuit stayed open without triggering a clean fallback. On top of that, health checks were running every 60s, which was too slow once things started degrading.
What we changed:
health checks every ~10–15s
circuit opens after 3 consecutive failures
45s backoff before retry
automatic reroute to a secondary provider when the circuit opens
We also moved routing out of the app layer so failover doesn’t depend on application logic anymore.
Second outage (April 7): similar duration.
No pages. p99 latency looked normal from the outside.
The obvious takeaway is redundancy.
The actual one: LLM APIs need the same SRE treatment as any other critical dependency. We just hadn’t applied it.