r/AIcosts

▲ 2 r/AIcosts+1 crossposts

Switched 70% of our agent traffic to DeepSeek R2 without a redeploy. Here's how

DeepSeek R2 came out last week; pricing roughly 70% lower than the Western frontier models we were using. For a pre-seed startup that number matters.

The problem with switching models mid-production: we had LangChain agents with prompts tuned to a specific provider's behavior. Every previous model switch meant updating config, testing, redeploying, and praying nothing broke at 2am. With 3 people on the team that's a half-day minimum.

What we did instead: route through a gateway with weighted routing config. Set R2 to handle 30% of traffic initially, watch error rates and output quality for 48 hours, then bump to 70%. No code changes. No redeploys. If R2 started producing bad outputs we could roll back in 30 seconds by changing a config value.

The 48-hour shadow period caught one prompt that broke badly on R2's tool-call format. Fixed it before it ever hit majority traffic. Would have been a production incident if we'd done a hard cutover.

Bill dropped 41.3% in the first week. Still watching quality metrics but so far no regressions on the tasks that matter.

reddit.com
u/Otherwise_Flan7339 — 2 days ago
▲ 3 r/AIcosts+1 crossposts

Anthropic confirmed their best model won't be public. 50 companies get it. We're not one of them.

Anthropic confirmed Claude Mythos (apparently their most capable model ever built) isn't going public. 50 organizations get access through a gated program called Project Glasswing. That's it.

I understand the reasoning. A model that's reportedly excellent at finding security vulnerabilities doesn't get a public API on day one. The responsible deployment argument is real.

But here's the practical impact for early-stage startups: we're now in a two-tier market. Fifty organizations get to build on capabilities the rest of us can't access. If Mythos is as capable as early reports suggest, those 50 companies have an 18-month head start on whatever product categories require that level of reasoning.

The compounding question nobody's talking about: the organizations with Glasswing access are almost certainly large enterprises, not pre-seed startups. They'll define what the frontier model is actually used for, ship products that set user expectations, and by the time public access opens, the category leaders will be entrenched.

OpenAI went through a version of this with GPT-4 access tiers in 2023. The early-access holders didn't dominate every category, but they owned the initial product narrative.

Nothing actionable here if you're a small team; we don't have the leverage to get into a 50-org whitelist. But if your product roadmap depends on frontier-level reasoning, worth acknowledging that the constraint is structural rather than just a waitlist.

reddit.com
u/Otherwise_Flan7339 — 3 days ago
▲ 4 r/AIcosts+1 crossposts

LLM API prices dropped ~80% since 2024. How are you updating your production cost models?

Input token costs across major providers have dropped roughly 80% in two years. Gemini Flash-Lite is now $0.25 per million input tokens. Models that cost $15/MTok in early 2024 are at $1-3 now.

The obvious effect is that things that were cost-prohibitive in 2024 are cheap today. But the less obvious effect is that your cost models from 18 months ago are wrong in ways that affect architectural decisions.

Specifically: a lot of teams built caching layers, batching pipelines, and model tiering strategies because frontier model costs forced it. Those decisions made sense at $15/MTok. At $1/MTok some of them are still worth it, and some of them are complexity you're maintaining for a cost that no longer exists.

We had a semantic caching layer that was saving us about $340/month at 2024 prices. At current prices that's $47/month. The engineering time to maintain it is worth more than $47/month. Ripped it out.

Conversely, we had batch jobs that used cheap models because the frontier was too expensive for non-critical tasks. That price signal is gone. Now we're running frontier models on more tasks and getting better output quality at the same cost.

The broader point: cost-driven architectural decisions have a shelf life. Worth auditing which ones were built for a price environment that no longer exists.

reddit.com
u/llamacoded — 2 days ago

Built a fully automated tax system with Claude; curious about real costs + ROI

Saw a post where someone used Claude to build a fully automated tax + accounting pipeline. Not chat. Actual Python code, workflows, and a structured tax knowledge base.

It handled:

  1. bank ingestion across accounts
  2. receipt matching via email scraping
  3. expense categorization
  4. VAT + reporting outputs
  5. edge case detection for accountant review

What’s interesting is the cost vs value discussion.

From what I found:

  • Typical AI accounting tools run about $200–600/month
  • Custom systems can host for under $50/month after build
  • ROI can hit 3–5x in year one due to time savings

But real-world builds seem more nuanced. One Reddit breakdown estimated:

> “Single agent: $25–50/month… 10 agents: $500–2,000/month”

So it feels like:

  • cheap if you DIY + control infra
  • expensive if you scale agents or use SaaS layers
  • unclear long-term maintenance cost

The big shift here is this:

AI is not doing the taxes. It is replacing the data plumbing and workflow layer.

So the real question:

Where is the actual breakeven point?

  • At what point does this beat a $1–3k/year accountant?
  • How much hidden cost is in maintenance, updates, edge cases?
  • Are people underestimating system complexity?

Feels like we are moving from “AI tool cost” to “AI system cost,” and most people are still pricing it wrong.

reddit.com
u/PopPsychological1218 — 4 days ago

we lost a client because our agent silently got worse and nothing in our logs caught it

we run a lead scoring agent for sales teams. takes inbound leads, enriches them, scores them 1-100, routes to the right rep. been running fine for months

three weeks ago one of our clients said their sales team felt like the leads were off. closing rates dropped from ~22% to 14%. that was not a fun call to be on

we checked everything. prompts hadn't changed. input data looked normal. no errors in the logs. the agent was still scoring leads and routing them. it just wasn't scoring them well anymore

took us almost a week to figure out what happened. anthropic had pushed some kind of update to sonnet. nothing announced, no changelog we could find. but our prompts that were tuned for the old behavior started producing slightly different score distributions. leads that used to get 75+ were coming in at 60-65. our threshold for "hot lead" was 70 so a bunch of genuinely good leads were getting routed to nurture instead of to a rep

nothing broke. no errors. everything looked fine. the model just quietly changed how it interpreted our scoring rubric and we had no way to detect that automatically

what we do now is route a copy of every scoring request through a second model and compare the outputs. if the delta between the two suddenly changes by more than a few points we get an alert. caught another drift last week within hours instead of weeks. in hindsight we should’ve been doing this from day one

the scariest part about building on hosted models isn't outages. it's silent updates that change your output distribution without telling you

reddit.com
u/Otherwise_Flan7339 — 6 days ago

Our agent's API bill dropped 40% after we stopped calling Opus for everything

We build AI agents for sales automation. Three months ago our inference costs were climbing fast because every agent call went through Claude Opus at $15/M output tokens. Most of those calls were simple stuff, extracting email addresses, classifying intent, summarizing call notes. Opus is massive overkill for that.

The fix was routing different tasks to different models based on complexity. Simple extraction goes to Haiku. Summarization goes to Sonnet. Only complex multi-step reasoning actually hits Opus. Sounds obvious in hindsight but when you're shipping fast you don't think about it until the bill lands.

We handle the routing through a gateway with weighted rules per endpoint. Didn't want to build a custom router into the app because then every model change needs a code deploy. The gateway handles it in config so we can adjust routing in real time when we see costs spike.

Went from $4.2k/month to $2.5k with zero quality drop on the tasks that got moved to cheaper models. The trick is knowing which tasks actually need frontier-level reasoning and which ones don't.

u/Otherwise_Flan7339 — 10 days ago

Switching LLM providers mid-deployment doubled our monthly bill before we caught it

We switched from Claude 3.5 to a newer model mid-sprint to take advantage of better throughput on our summarisation workload. The bill jumped from $3,840 to $7,200 in 11 days before anyone noticed. The obvious suspect was token count differences, and yeah that was part of it, but the actual driver was that our retry logic was tuned for the old model's p99 latency profile. New model was faster on average but had worse tail latency under load, so retries were firing at a threshold that made sense before and was completely wrong now. We'd never looked at retry-attributed spend as its own line item, just total token cost. Turns out retry storms can quietly double your bill while all your success metrics look fine.

reddit.com
u/clairenguyen_ops — 5 hours ago

The hidden cost of LLM model deprecations is eval reproducibility, not config changes

Claude 3 Haiku is being retired this month. Third model deprecation in our stack this year. Saw a post about this on r/MachineLearning and all the comments were "just use config aliases, this is a skill issue." They're not wrong about the config part. But config is maybe 5% of the actual work.

Here's what actually happens when a model gets deprecated in a production ML system:

Your regression test suite breaks. Every golden example was generated by the old model. The new model produces different outputs that are probably fine but you need a human to review each one and confirm. For us that's about 200 test cases. Two days of an engineer's time just reviewing diffs.

Your experiment history becomes unverifiable. We have 4 months of A/B test results comparing prompt strategies. All evaluated against old Haiku. Those results cannot be reproduced anymore. The model is gone. If a new hire asks why we chose approach B, the honest answer is "it tested better on a model that no longer exists."

Your cost projections break. New Haiku has different pricing. Every cost model, budget alert, and capacity plan that referenced old Haiku's pricing needs to be updated. Our finance team had already locked in Q2 projections based on old pricing.

Your monitoring baselines shift. Latency percentiles, token counts, error rates are all different for the new model. Every alerting threshold needs to be recalibrated or you get a week of false alarms.

The config change takes 5 minutes. The downstream cleanup takes a week. And this happens 2-3 times a year per provider now. Nobody has good tooling for this because the MLOps ecosystem still treats model selection as a one-time decision rather than an ongoing operational concern.

reddit.com
u/Character-File-6003 — 9 days ago

Anthropic killed 135,000 OpenClaw integrations overnight and nobody learned the right lesson

On April 4, Anthropic revoked OAuth access for OpenClaw, a third-party tool that let developers route through Claude Pro/Max subscriptions. 135,000+ instances went dark. Developers who were paying $20/month for a subscription suddenly faced 10-50x cost increases to get the same access through the API.

Anthropic cited "outsized strain" on infrastructure. Maybe. There was also a security vulnerability disclosed for OpenClaw the week before. Either way, the result was the same: thousands of developers' workflows broke with zero notice.

The discourse has been "Anthropic bad" or "developers shouldn't have relied on subscription access for production." Both miss the actual lesson.

The lesson is that every AI provider will eventually change the economics on you. OpenAI did it when they deprecated older models. Google did it with Gemini tier restructuring. Anthropic just did it with OAuth access. The specific trigger doesn't matter. What matters is whether your architecture survives when it happens.

The startups I know that handled the OpenClaw cutoff without scrambling had one thing in common: their application code didn't know it was talking to Anthropic. It talked to an intermediate layer that routed to whatever provider was available and affordable. When Claude access changed, they adjusted the routing. No code changes, no redeployment, no customer impact.

The ones that scrambled had Claude's API hardcoded everywhere. They spent the week rewriting integrations instead of building product.

Build like every provider will eventually screw you. Because they will.

reddit.com
u/Otherwise_Flan7339 — 9 days ago

The downstream cost of swapping to Claude Opus 4.6 after the March pricing drop

Last month we reran our full A/B testing suite on a fairness-sensitive recommendation model after Anthropic dropped Opus pricing by 67% and expanded context to 1M tokens. On paper it looked like a clear win: same 5/25 rate as the previous generation but with better reasoning on causal estimates.

The complication hit when we compared the new eval baselines. Even after aligning on identical prompts and temperature=0, the treatment effect estimates shifted by 0.12–0.19 percentage points on our primary metric (a causal risk difference). More worrying, the variance in bootstrap confidence intervals increased 23% on subgroups with protected attributes. What looked like “better intelligence” was quietly changing our fairness audit outcomes. Re-running the entire 14-day experiment window with the old model was not feasible without burning through the newly saved budget.

We ended up building a lightweight shadow routing layer that sends 15% of production traffic to the legacy model purely for continuity checking. The config change itself took under an hour. Reconciling the diverging eval distributions is still ongoing three weeks later.

The real lesson is that model upgrades in production causal pipelines are rarely just a pricing or capability story.

reddit.com
u/clairedoesdata — 9 days ago

Qwen 3.6-Plus, Agentic Coding, and the Causal Inference Gap

The recent release of Qwen 3.6-Plus, announced mid-May 2024, with its 1M context window and enhanced agentic coding capabilities, has naturally amplified discussions around truly autonomous agents. The excitement is palpable; the prospect of an LLM not just generating code but orchestrating complex execution pipelines, identifying errors, and self-correcting, promises a significant shift in development paradigms, particularly for tasks involving software engineering.

However, this very autonomy introduces a subtle, yet profound, causal inference challenge that often gets overlooked. When an agent self-corrects based on an observed outcome, are we witnessing true causal reasoning, or merely sophisticated correlation mapping within its vast parameter space? My experience across thousands of A/B tests in financial tech suggests a critical distinction. A system designed to optimize for a metric often learns the what and when, not the why.

The 1M context window, while impressive for synthesizing observational data, doesn't inherently imbue the model with a counterfactual understanding. If an agent refactors code and a performance metric improves, it observed an association. It did not necessarily intervene on the true causal lever in a way that generalizes robustly outside its immediate operational context. The risk lies in attributing causal agency where only predictive excellence exists, potentially leading to brittle systems that fail when an unobserved covariate shifts. Pour moi, the real leap will be when these agents can articulate and rigorously test specific causal hypotheses, not just optimize via iterative trial and error.

reddit.com
u/clairedoesdata — 7 days ago