u/llamacoded

switched from liteLLM to a go based proxy, tradeoffs after a month

we were on litellm for about 6 months and it was mostly fine. the thing that eventually killed it for us was streaming latency. every request was getting maybe 5-8ms added which doesn't sound bad until you stack tool calls in a multi-turn agent and the user is sitting there watching a spinner for an extra 200ms per turn. we spent two weeks trying to optimize it and i'm still not sure if it was litellm or our setup but we couldn't get it lower. could totally be skill issue on our end tbh

switched to bifrost which is a go proxy. latency is better but the migration took a bit of effort. we had a few provider configs that didn’t transfer cleanly and one of our test providers isn’t supported yet so we paused that integration. not a blocker for us but worth calling out

the one thing that actually surprised me was the cost logging. we could see per-request costs tagged by endpoint and that's how we found out our summarization step was doing 5 retries on failures and each retry was resending full context. was costing us roughly 3x what we thought for that step. litellm gives you cost data but it's per-provider not per-request so we never would have caught that

that said the docs are still catching up. i had to read go source code once or twice to figure out some config options. filed issues and got responses pretty fast though so that helped

not saying everyone should switch. litellm has way more providers and if you're a python shop extending it is easy. we just had a specific latency problem and this solved it for us

reddit.com
u/llamacoded — 2 days ago
▲ 4 r/AIcosts+1 crossposts

LLM API prices dropped ~80% since 2024. How are you updating your production cost models?

Input token costs across major providers have dropped roughly 80% in two years. Gemini Flash-Lite is now $0.25 per million input tokens. Models that cost $15/MTok in early 2024 are at $1-3 now.

The obvious effect is that things that were cost-prohibitive in 2024 are cheap today. But the less obvious effect is that your cost models from 18 months ago are wrong in ways that affect architectural decisions.

Specifically: a lot of teams built caching layers, batching pipelines, and model tiering strategies because frontier model costs forced it. Those decisions made sense at $15/MTok. At $1/MTok some of them are still worth it, and some of them are complexity you're maintaining for a cost that no longer exists.

We had a semantic caching layer that was saving us about $340/month at 2024 prices. At current prices that's $47/month. The engineering time to maintain it is worth more than $47/month. Ripped it out.

Conversely, we had batch jobs that used cheap models because the frontier was too expensive for non-critical tasks. That price signal is gone. Now we're running frontier models on more tasks and getting better output quality at the same cost.

The broader point: cost-driven architectural decisions have a shelf life. Worth auditing which ones were built for a price environment that no longer exists.

reddit.com
u/llamacoded — 2 days ago

Switching to DeepSeek R2 broke our evals (not for the reason we expected)

We tried swapping to DeepSeek R2 after the pricing drop. Expected some quality differences. That’s not what broke.

Our evals were calibrated on Claude Sonnet outputs. Not as ground truth, just as a consistent baseline. We use a model-as-judge setup, and all our pass/fail thresholds were tuned to Sonnet’s scoring distribution.

R2 doesn’t score the same way.

On some reasoning tasks it’s more lenient, on others stricter. Our “~80% pass rate = ship” threshold instantly became meaningless. At first it looked like a regression, but it was just a calibration shift.

What worked for us:

  • run both models in parallel on the same eval set
  • compare score distributions instead of raw pass rates
  • remap thresholds before making any decision

Only after that did the comparison make sense.

If you’re testing new models and your evals depend on a judge model, don’t assume scores are interchangeable. The baseline matters more than the model you’re swapping in.

We ended up running both models in shadow for a bit to figure this out without breaking anything.

reddit.com
u/llamacoded — 3 days ago

GLM-5.1 allegedly beat Claude Opus 4.6 and GPT-5.4 on SWE-Bench Pro. Why I'm skeptical.

GLM-5.1 released last week — 744B parameters, MIT license, 40B active per forward pass, 200K context. The headline is it beat both Claude Opus 4.6 and GPT-5.4 on SWE-Bench Pro. That's a significant claim.

My issue with SWE-Bench Pro: the eval methodology matters enormously. The difference between "model solved the GitHub issue" and "model produced output that passed the test suite" is substantial. Test suites for open-source repos have gaps. A model that learned to produce plausible-looking diffs that pass existing tests isn't the same as a model that actually understood the bug.

Also, 744B MoE with 40B active is not comparable to a 100B dense model in deployment cost. The "40B active parameters" framing undersells the routing overhead, KV cache size at 200K context, and cold-start behavior on sparse expert activations. The inference math is not simple.

None of this means GLM-5.1 is bad; early numbers from people running it locally look genuinely strong on a range of tasks. But benchmark comparisons between architecturally different models on a single eval set are weak evidence. I want to see it on real production task distributions, not curated GitHub issues from a fixed test set.

The MIT license is the actually important part. That changes the deployment math for enterprises with data residency requirements in a way the benchmark numbers don't.

reddit.com
u/llamacoded — 3 days ago

GLM-5.1 allegedly beat Claude Opus 4.6 and GPT-5.4 on SWE-Bench Pro. Why I'm skeptical.

GLM-5.1 released last week — 744B parameters, MIT license, 40B active per forward pass, 200K context. The headline is it beat both Claude Opus 4.6 and GPT-5.4 on SWE-Bench Pro. That's a significant claim.

My issue with SWE-Bench Pro: the eval methodology matters enormously. The difference between "model solved the GitHub issue" and "model produced output that passed the test suite" is substantial. Test suites for open-source repos have gaps. A model that learned to produce plausible-looking diffs that pass existing tests isn't the same as a model that actually understood the bug.

Also, 744B MoE with 40B active is not comparable to a 100B dense model in deployment cost. The "40B active parameters" framing undersells the routing overhead, KV cache size at 200K context, and cold-start behavior on sparse expert activations. The inference math is not simple.

None of this means GLM-5.1 is bad; early numbers from people running it locally look genuinely strong on a range of tasks. But benchmark comparisons between architecturally different models on a single eval set are weak evidence. I want to see it on real production task distributions, not curated GitHub issues from a fixed test set.

The MIT license is the actually important part. That changes the deployment math for enterprises with data residency requirements in a way the benchmark numbers don't.

reddit.com
u/llamacoded — 3 days ago

switched from liteLLM to a go based proxy, tradeoffs after a month

we were on litellm for about 6 months and it was mostly fine. the thing that eventually killed it for us was streaming latency. every request was getting maybe 5-8ms added which doesn't sound bad until you stack tool calls in a multi-turn agent and the user is sitting there watching a spinner for an extra 200ms per turn. we spent two weeks trying to optimize it and i'm still not sure if it was litellm or our setup but we couldn't get it lower. could totally be skill issue on our end tbh

switched to bifrost which is a go proxy. latency is better but the migration took a bit of effort. we had a few provider configs that didn’t transfer cleanly and one of our test providers isn’t supported yet so we paused that integration. not a blocker for us but worth calling out

the one thing that actually surprised me was the cost logging. we could see per-request costs tagged by endpoint and that's how we found out our summarization step was doing 5 retries on failures and each retry was resending full context. was costing us roughly 3x what we thought for that step. litellm gives you cost data but it's per-provider not per-request so we never would have caught that

that said the docs are still catching up. i had to read go source code once or twice to figure out some config options. filed issues and got responses pretty fast though so that helped

not saying everyone should switch. litellm has way more providers and if you're a python shop extending it is easy. we just had a specific latency problem and this solved it for us

u/llamacoded — 6 days ago
▲ 9 r/LLM_Gateways+1 crossposts

switched from liteLLM to a go based proxy, tradeoffs after a month

we were on litellm for about 6 months and it was mostly fine. the thing that eventually killed it for us was streaming latency. every request was getting maybe 5-8ms added which doesn't sound bad until you stack tool calls in a multi-turn agent and the user is sitting there watching a spinner for an extra 200ms per turn. we spent two weeks trying to optimize it and i'm still not sure if it was litellm or our setup but we couldn't get it lower. could totally be skill issue on our end tbh

switched to bifrost which is a go proxy. latency is better but the migration took a bit of effort. we had a few provider configs that didn’t transfer cleanly and one of our test providers isn’t supported yet so we paused that integration. not a blocker for us but worth calling out

the one thing that actually surprised me was the cost logging. we could see per-request costs tagged by endpoint and that's how we found out our summarization step was doing 5 retries on failures and each retry was resending full context. was costing us roughly 3x what we thought for that step. litellm gives you cost data but it's per-provider not per-request so we never would have caught that

that said the docs are still catching up. i had to read go source code once or twice to figure out some config options. filed issues and got responses pretty fast though so that helped

not saying everyone should switch. litellm has way more providers and if you're a python shop extending it is easy. we just had a specific latency problem and this solved it for us

reddit.com
u/llamacoded — 6 days ago

Claude Mythos is behind a 50-company firewall and it tells you everything about Anthropic's strategy

Anthropic has their best model, Claude Mythos, locked behind Project Glasswing. 50 partner organizations have access. Everyone else gets Opus 4.6 and Sonnet.

This isn't new in tech. Apple did it with early iPhone SDK access. OpenAI did it with GPT-4 before public launch. But the implications for teams building on Claude are worth thinking about.

If your competitor is one of those 50 companies, they're building with a model that's reportedly a step change above what you have access to. Your prompts, your evals, your product decisions are all calibrated against Opus 4.6. When Mythos goes public, your entire baseline shifts. The prompt strategies you optimized over months might underperform on a model with different strengths.

The more practical concern is dependency planning. If you're deep into the Anthropic ecosystem, your roadmap is partially gated by their release schedule. You can't plan around a model you can't test. And when it does drop, you're in a rush to migrate while your competitors who had early access are already shipping features on it.

This is the part of single-provider dependency that doesn't show up in the architecture diagrams. It's not just about uptime and failover. It's about access to capability being unevenly distributed by business relationships.

reddit.com
u/llamacoded — 7 days ago

Claude Mythos is behind a 50-company firewall and it tells you everything about Anthropic's strategy

Anthropic has their best model, Claude Mythos, locked behind Project Glasswing. 50 partner organizations have access. Everyone else gets Opus 4.6 and Sonnet.

This isn't new in tech. Apple did it with early iPhone SDK access. OpenAI did it with GPT-4 before public launch. But the implications for teams building on Claude are worth thinking about.

If your competitor is one of those 50 companies, they're building with a model that's reportedly a step change above what you have access to. Your prompts, your evals, your product decisions are all calibrated against Opus 4.6. When Mythos goes public, your entire baseline shifts. The prompt strategies you optimized over months might underperform on a model with different strengths.

The more practical concern is dependency planning. If you're deep into the Anthropic ecosystem, your roadmap is partially gated by their release schedule. You can't plan around a model you can't test. And when it does drop, you're in a rush to migrate while your competitors who had early access are already shipping features on it.

This is the part of single-provider dependency that doesn't show up in the architecture diagrams. It's not just about uptime and failover. It's about access to capability being unevenly distributed by business relationships.

reddit.com
u/llamacoded — 7 days ago