u/Character-File-6003 — reddlx

Post-mortem: two Claude outages in 48 hours and what actually broke in our failover

April 6, 2:14am PST: Claude API goes down. 100% error rate across LLM-dependent services. Duration: ~2 hours.

April 7: happens again. Slightly shorter, same impact.

We’d been treating the Claude API like a database. Reliable enough that we didn’t build proper failover around it. That assumption didn’t hold.

The failure wasn’t just “provider down.” It was our failover logic.

We had a circuit breaker, but it was tuned for short spikes. A multi-hour outage exceeded the retry window, so the circuit stayed open without triggering a clean fallback. On top of that, health checks were running every 60s, which was too slow once things started degrading.

What we changed:

health checks every ~10–15s

circuit opens after 3 consecutive failures

45s backoff before retry

automatic reroute to a secondary provider when the circuit opens

We also moved routing out of the app layer so failover doesn’t depend on application logic anymore.

Second outage (April 7): similar duration.

No pages. p99 latency looked normal from the outside.

The obvious takeaway is redundancy.

The actual one: LLM APIs need the same SRE treatment as any other critical dependency. We just hadn’t applied it.

reddit.com

u/Character-File-6003 — 3 days ago

▲ 2 r/aiagents

Your LLM API is an infra dependency. Most teams aren't treating it like one.

Two separate Claude outages in 48 hours this month — April 6 and April 7, roughly 2 hours each — is what finally made us apply proper reliability patterns to our LLM dependencies.

What surprised me: we had more robust failover for our payment processor than for our LLM provider, even though the LLM was in more user-facing critical paths. Somehow the AI API got a pass that Stripe never would have.

The SRE patterns aren't new. Health checks. Circuit breakers with thresholds appropriate for your failure modes. Graceful degradation when the primary is unavailable. Runbooks for what on-call does when the circuit opens. We had all of this for our database read replicas, for the CDN, for third-party payment APIs. We'd hand-waved it for the LLM because it "usually works."

The harder piece is that LLM failover isn't clean like a database hot standby. Different providers have different prompt sensitivities, different tool call response formats, different defaults on edge cases. A transparent failover to GPT-4o when Claude goes down is not actually transparent if 30% of your system prompts assume Claude-specific formatting behavior.

So now prompt compatibility is a deployment artifact. Every system prompt gets tested against the failover provider as part of CI. Doesn't catch everything, but caught 4 regressions in the first week alone.

Two incidents in 48 hours was the wake-up call. Shouldn't have taken that.

reddit.com

u/Character-File-6003 — 3 days ago

▲ 1 r/PromptEngineering

Your LLM cost monitoring is probably wrong because you're trusting the client's token count

Claude Code v2.1.100 is injecting ~20K invisible tokens per request. Your /context view says 50K, the actual API call is 70K. Anthropic hasn't commented. Users are hitting quota in 90 minutes on $200/month Max plans.

This is the latest example but the pattern is universal. Every client tool, framework, and SDK adds overhead that isn't visible to the user. System prompts, safety instructions, tool definitions, conversation formatting. The gap between what you think you're sending and what you're actually billed for is real and growing.

We caught a similar discrepancy last month when our per-request cost dashboard showed numbers 25% higher than what our application was calculating. Turned out our LangChain wrapper was appending a 3K token system prompt to every call that wasn't accounted for in our cost model. We'd been under-reporting costs by $1,100/month for three months.

After that we moved all cost tracking to the proxy layer. Everything routes through a gateway ([this one](https://git.new/bifrost)) that extracts the usage object from the provider's response headers. That's the source of truth for billing. What the client says it sent is logged for debugging but never used for cost attribution.

If your cost monitoring is based on counting tokens in your application code, you're almost certainly under-reporting. The only reliable number is what the provider says it processed, and even that deserves an occasional spot check.

reddit.com

u/Character-File-6003 — 4 days ago

▲ 1 r/AIcosts

The hidden cost of LLM model deprecations is eval reproducibility, not config changes

Claude 3 Haiku is being retired this month. Third model deprecation in our stack this year. Saw a post about this on r/MachineLearning and all the comments were "just use config aliases, this is a skill issue." They're not wrong about the config part. But config is maybe 5% of the actual work.

Here's what actually happens when a model gets deprecated in a production ML system:

Your regression test suite breaks. Every golden example was generated by the old model. The new model produces different outputs that are probably fine but you need a human to review each one and confirm. For us that's about 200 test cases. Two days of an engineer's time just reviewing diffs.

Your experiment history becomes unverifiable. We have 4 months of A/B test results comparing prompt strategies. All evaluated against old Haiku. Those results cannot be reproduced anymore. The model is gone. If a new hire asks why we chose approach B, the honest answer is "it tested better on a model that no longer exists."

Your cost projections break. New Haiku has different pricing. Every cost model, budget alert, and capacity plan that referenced old Haiku's pricing needs to be updated. Our finance team had already locked in Q2 projections based on old pricing.

Your monitoring baselines shift. Latency percentiles, token counts, error rates are all different for the new model. Every alerting threshold needs to be recalibrated or you get a week of false alarms.

The config change takes 5 minutes. The downstream cleanup takes a week. And this happens 2-3 times a year per provider now. Nobody has good tooling for this because the MLOps ecosystem still treats model selection as a one-time decision rather than an ongoing operational concern.

reddit.com

u/Character-File-6003 — 9 days ago