u/clairenguyen_ops

We switched from Claude 3.5 to a newer model mid-sprint to take advantage of better throughput on our summarisation workload. The bill jumped from $3,840 to $7,200 in 11 days before anyone noticed. The obvious suspect was token count differences, and yeah that was part of it, but the actual driver was that our retry logic was tuned for the old model's p99 latency profile. New model was faster on average but had worse tail latency under load, so retries were firing at a threshold that made sense before and was completely wrong now. We'd never looked at retry-attributed spend as its own line item, just total token cost. Turns out retry storms can quietly double your bill while all your success metrics look fine.

Switching LLM providers mid-deployment doubled our monthly bill before we caught it