u/Otherwise_Flan7339

Switched 70% of our agent traffic to DeepSeek R2 without a redeploy. Here's how

Switched 70% of our agent traffic to DeepSeek R2 without a redeploy. Here's how

DeepSeek R2 came out last week; pricing roughly 70% lower than the Western frontier models we were using. For a pre-seed startup that number matters.

The problem with switching models mid-production: we had LangChain agents with prompts tuned to a specific provider's behavior. Every previous model switch meant updating config, testing, redeploying, and praying nothing broke at 2am. With 3 people on the team that's a half-day minimum.

What we did instead: route through a gateway (we use this oss one) with weighted routing config. Set R2 to handle 30% of traffic initially, watch error rates and output quality for 48 hours, then bump to 70%. No code changes. No redeploys. If R2 started producing bad outputs we could roll back in 30 seconds by changing a config value.

The 48-hour shadow period caught one prompt that broke badly on R2's tool-call format. Fixed it before it ever hit majority traffic. Would have been a production incident if we'd done a hard cutover.

Bill dropped 41.3% in the first week. Still watching quality metrics but so far no regressions on the tasks that matter.

u/Otherwise_Flan7339 — 2 days ago

Switched 70% of our agent traffic to DeepSeek R2 without a redeploy. Here's how

DeepSeek R2 came out last week; pricing roughly 70% lower than the Western frontier models we were using. For a pre-seed startup that number matters.

The problem with switching models mid-production: we had LangChain agents with prompts tuned to a specific provider's behavior. Every previous model switch meant updating config, testing, redeploying, and praying nothing broke at 2am. With 3 people on the team that's a half-day minimum.

What we did instead: route through a gateway with weighted routing config. Set R2 to handle 30% of traffic initially, watch error rates and output quality for 48 hours, then bump to 70%. No code changes. No redeploys. If R2 started producing bad outputs we could roll back in 30 seconds by changing a config value.

The 48-hour shadow period caught one prompt that broke badly on R2's tool-call format. Fixed it before it ever hit majority traffic. Would have been a production incident if we'd done a hard cutover.

Bill dropped 41.3% in the first week. Still watching quality metrics but so far no regressions on the tasks that matter.

reddit.com
u/Otherwise_Flan7339 — 2 days ago
▲ 2 r/AIcosts+1 crossposts

Switched 70% of our agent traffic to DeepSeek R2 without a redeploy. Here's how

DeepSeek R2 came out last week; pricing roughly 70% lower than the Western frontier models we were using. For a pre-seed startup that number matters.

The problem with switching models mid-production: we had LangChain agents with prompts tuned to a specific provider's behavior. Every previous model switch meant updating config, testing, redeploying, and praying nothing broke at 2am. With 3 people on the team that's a half-day minimum.

What we did instead: route through a gateway with weighted routing config. Set R2 to handle 30% of traffic initially, watch error rates and output quality for 48 hours, then bump to 70%. No code changes. No redeploys. If R2 started producing bad outputs we could roll back in 30 seconds by changing a config value.

The 48-hour shadow period caught one prompt that broke badly on R2's tool-call format. Fixed it before it ever hit majority traffic. Would have been a production incident if we'd done a hard cutover.

Bill dropped 41.3% in the first week. Still watching quality metrics but so far no regressions on the tasks that matter.

reddit.com
u/Otherwise_Flan7339 — 2 days ago

Claude went down for 2 hours. Our fallback broke in a non-obvious way.

When Claude went down earlier this month, we had an onboarding agent fully dependent on it. Failover to GPT-4o was supposed to be straightforward. It wasn’t.

The issue wasn’t availability. It was prompts.

Our system prompt had been tuned over months for Claude’s instruction-following and response structure. When we switched models, ~30% of tool calls came back in the wrong format. Same logic, same tools, just a different model behavior.

Nothing completely broke, but enough friction to create support tickets and manual fixes.

What we changed after:

  • maintain separate prompt variants per provider
  • test both regularly on a small eval set
  • treat prompts like deployment artifacts, not static text

We also moved to a setup where traffic can fail over automatically between providers without touching application code.

The next outage happened the day after. Similar duration. This time, we didn’t notice it at the application layer. The obvious takeaway is “don’t rely on one provider.”
The less obvious one: prompt portability is a real problem, and you only discover it when things break.

reddit.com
u/Otherwise_Flan7339 — 3 days ago
▲ 3 r/AIcosts+1 crossposts

Anthropic confirmed their best model won't be public. 50 companies get it. We're not one of them.

Anthropic confirmed Claude Mythos (apparently their most capable model ever built) isn't going public. 50 organizations get access through a gated program called Project Glasswing. That's it.

I understand the reasoning. A model that's reportedly excellent at finding security vulnerabilities doesn't get a public API on day one. The responsible deployment argument is real.

But here's the practical impact for early-stage startups: we're now in a two-tier market. Fifty organizations get to build on capabilities the rest of us can't access. If Mythos is as capable as early reports suggest, those 50 companies have an 18-month head start on whatever product categories require that level of reasoning.

The compounding question nobody's talking about: the organizations with Glasswing access are almost certainly large enterprises, not pre-seed startups. They'll define what the frontier model is actually used for, ship products that set user expectations, and by the time public access opens, the category leaders will be entrenched.

OpenAI went through a version of this with GPT-4 access tiers in 2023. The early-access holders didn't dominate every category, but they owned the initial product narrative.

Nothing actionable here if you're a small team; we don't have the leverage to get into a 50-org whitelist. But if your product roadmap depends on frontier-level reasoning, worth acknowledging that the constraint is structural rather than just a waitlist.

reddit.com
u/Otherwise_Flan7339 — 3 days ago

Anthropic confirmed their best model won't be public. 50 companies get it. We're not one of them.

Anthropic confirmed Claude Mythos (apparently their most capable model ever built) isn't going public. 50 organizations get access through a gated program called Project Glasswing. That's it.

I understand the reasoning. A model that's reportedly excellent at finding security vulnerabilities doesn't get a public API on day one. The responsible deployment argument is real.

But here's the practical impact for early-stage startups: we're now in a two-tier market. Fifty organizations get to build on capabilities the rest of us can't access. If Mythos is as capable as early reports suggest, those 50 companies have an 18-month head start on whatever product categories require that level of reasoning.

The compounding question nobody's talking about: the organizations with Glasswing access are almost certainly large enterprises, not pre-seed startups. They'll define what the frontier model is actually used for, ship products that set user expectations, and by the time public access opens, the category leaders will be entrenched.

OpenAI went through a version of this with GPT-4 access tiers in 2023. The early-access holders didn't dominate every category, but they owned the initial product narrative.

Nothing actionable here if you're a small team; we don't have the leverage to get into a 50-org whitelist. But if your product roadmap depends on frontier-level reasoning, worth acknowledging that the constraint is structural rather than just a waitlist.

reddit.com
u/Otherwise_Flan7339 — 3 days ago

we lost a client because our agent silently got worse and nothing in our logs caught it

we run a lead scoring agent for sales teams. takes inbound leads, enriches them, scores them 1-100, routes to the right rep. been running fine for months

three weeks ago one of our clients said their sales team felt like the leads were off. closing rates dropped from ~22% to 14%. that was not a fun call to be on

we checked everything. prompts hadn't changed. input data looked normal. no errors in the logs. the agent was still scoring leads and routing them. it just wasn't scoring them well anymore

took us almost a week to figure out what happened. anthropic had pushed some kind of update to sonnet. nothing announced, no changelog we could find. but our prompts that were tuned for the old behavior started producing slightly different score distributions. leads that used to get 75+ were coming in at 60-65. our threshold for "hot lead" was 70 so a bunch of genuinely good leads were getting routed to nurture instead of to a rep

nothing broke. no errors. everything looked fine. the model just quietly changed how it interpreted our scoring rubric and we had no way to detect that automatically

what we do now is route a copy of every scoring request through a second model and compare the outputs. if the delta between the two suddenly changes by more than a few points we get an alert. caught another drift last week within hours instead of weeks. in hindsight we should’ve been doing this from day one

the scariest part about building on hosted models isn't outages. it's silent updates that change your output distribution without telling you

reddit.com
u/Otherwise_Flan7339 — 6 days ago

we lost a client because our agent silently got worse and nothing in our logs caught it

we run a lead scoring agent for sales teams. takes inbound leads, enriches them, scores them 1-100, routes to the right rep. been running fine for months

three weeks ago one of our clients said their sales team felt like the leads were off. closing rates dropped from ~22% to 14%. that was not a fun call to be on

we checked everything. prompts hadn't changed. input data looked normal. no errors in the logs. the agent was still scoring leads and routing them. it just wasn't scoring them well anymore

took us almost a week to figure out what happened. anthropic had pushed some kind of update to sonnet. nothing announced, no changelog we could find. but our prompts that were tuned for the old behavior started producing slightly different score distributions. leads that used to get 75+ were coming in at 60-65. our threshold for "hot lead" was 70 so a bunch of genuinely good leads were getting routed to nurture instead of to a rep

nothing broke. no errors. everything looked fine. the model just quietly changed how it interpreted our scoring rubric and we had no way to detect that automatically

what we do now is route a copy of every scoring request through a second model and compare the outputs. if the delta between the two suddenly changes by more than a few points we get an alert. caught another drift last week within hours instead of weeks. in hindsight we should’ve been doing this from day one

the scariest part about building on hosted models isn't outages. it's silent updates that change your output distribution without telling you

reddit.com
u/Otherwise_Flan7339 — 6 days ago

we lost a client because our agent silently got worse and nothing in our logs caught it

we run a lead scoring agent for sales teams. takes inbound leads, enriches them, scores them 1-100, routes to the right rep. been running fine for months

three weeks ago one of our clients said their sales team felt like the leads were off. closing rates dropped from ~22% to 14%. that was not a fun call to be on

we checked everything. prompts hadn't changed. input data looked normal. no errors in the logs. the agent was still scoring leads and routing them. it just wasn't scoring them well anymore

took us almost a week to figure out what happened. anthropic had pushed some kind of update to sonnet. nothing announced, no changelog we could find. but our prompts that were tuned for the old behavior started producing slightly different score distributions. leads that used to get 75+ were coming in at 60-65. our threshold for "hot lead" was 70 so a bunch of genuinely good leads were getting routed to nurture instead of to a rep

nothing broke. no errors. everything looked fine. the model just quietly changed how it interpreted our scoring rubric and we had no way to detect that automatically

what we do now is route a copy of every scoring request through a second model and compare the outputs. if the delta between the two suddenly changes by more than a few points we get an alert. caught another drift last week within hours instead of weeks. in hindsight we should’ve been doing this from day one

the scariest part about building on hosted models isn't outages. it's silent updates that change your output distribution without telling you

reddit.com
u/Otherwise_Flan7339 — 6 days ago

we lost a client because our agent silently got worse and nothing in our logs caught it

we run a lead scoring agent for sales teams. takes inbound leads, enriches them, scores them 1-100, routes to the right rep. been running fine for months

three weeks ago one of our clients said their sales team felt like the leads were off. closing rates dropped from ~22% to 14%. that was not a fun call to be on

we checked everything. prompts hadn't changed. input data looked normal. no errors in the logs. the agent was still scoring leads and routing them. it just wasn't scoring them well anymore

took us almost a week to figure out what happened. anthropic had pushed some kind of update to sonnet. nothing announced, no changelog we could find. but our prompts that were tuned for the old behavior started producing slightly different score distributions. leads that used to get 75+ were coming in at 60-65. our threshold for "hot lead" was 70 so a bunch of genuinely good leads were getting routed to nurture instead of to a rep

nothing broke. no errors. everything looked fine. the model just quietly changed how it interpreted our scoring rubric and we had no way to detect that automatically

what we do now is route a copy of every scoring request through a second model and compare the outputs (we do this via bifrost + langfuse, gateway handles routing and langfuse for traces). if the delta between the two suddenly changes by more than a few points we get an alert. caught another drift last week within hours instead of weeks. in hindsight we should’ve been doing this from day one

the scariest part about building on hosted models isn't outages. it's silent updates that change your output distribution without telling you

reddit.com
u/Otherwise_Flan7339 — 6 days ago

Anthropic killed 135,000 OpenClaw integrations overnight and nobody learned the right lesson

On April 4, Anthropic revoked OAuth access for OpenClaw, a third-party tool that let developers route through Claude Pro/Max subscriptions. 135,000+ instances went dark. Developers who were paying $20/month for a subscription suddenly faced 10-50x cost increases to get the same access through the API.

Anthropic cited "outsized strain" on infrastructure. Maybe. There was also a security vulnerability disclosed for OpenClaw the week before. Either way, the result was the same: thousands of developers' workflows broke with zero notice.

The discourse has been "Anthropic bad" or "developers shouldn't have relied on subscription access for production." Both miss the actual lesson.

The lesson is that every AI provider will eventually change the economics on you. OpenAI did it when they deprecated older models. Google did it with Gemini tier restructuring. Anthropic just did it with OAuth access. The specific trigger doesn't matter. What matters is whether your architecture survives when it happens.

The startups I know that handled the OpenClaw cutoff without scrambling had one thing in common: their application code didn't know it was talking to Anthropic. It talked to an intermediate layer that routed to whatever provider was available and affordable. When Claude access changed, they adjusted the routing. No code changes, no redeployment, no customer impact.

The ones that scrambled had Claude's API hardcoded everywhere. They spent the week rewriting integrations instead of building product.

Build like every provider will eventually screw you. Because they will.

reddit.com
u/Otherwise_Flan7339 — 9 days ago

Our agent's API bill dropped 40% after we stopped calling Opus for everything

We build AI agents for sales automation. Three months ago our inference costs were climbing fast because every agent call went through Claude Opus at $15/M output tokens. Most of those calls were simple stuff, extracting email addresses, classifying intent, summarizing call notes. Opus is massive overkill for that.

The fix was routing different tasks to different models based on complexity. Simple extraction goes to Haiku. Summarization goes to Sonnet. Only complex multi-step reasoning actually hits Opus. Sounds obvious in hindsight but when you're shipping fast you don't think about it until the bill lands.

We handle the routing through a gateway with weighted rules per endpoint. Didn't want to build a custom router into the app because then every model change needs a code deploy. The gateway handles it in config so we can adjust routing in real time when we see costs spike.

Went from $4.2k/month to $2.5k with zero quality drop on the tasks that got moved to cheaper models. The trick is knowing which tasks actually need frontier-level reasoning and which ones don't.

u/Otherwise_Flan7339 — 10 days ago