r/AIQuality

▲ 16 r/AIQuality+6 crossposts

Your agent passes benchmarks. Then a tool returns bad JSON and everything falls apart. I built an open source harness to test that locally. Ollama supported!

Most agent evals test whether an agent can solve the happy-path task.

But in practice, agents usually break somewhere else:

  • tool returns malformed JSON
  • API rate limits mid-run
  • context gets too long
  • schema changes slightly
  • retrieval quality drops
  • prompt injection slips in through context

That gap bothered me, so I built EvalMonkey.

It is an open source local harness for LLM agents that does two things:

  1. Runs your agent on standard benchmarks
  2. Re-runs those same tasks under controlled failure conditions to measure how hard it degrades

So instead of only asking:

"Can this agent solve the task?"

you can also ask:

"What happens when reality gets messy?"

A few examples of what it can test:

  • malformed tool outputs
  • missing fields / schema drift
  • latency and rate limit behavior
  • prompt injection variants
  • long-context stress
  • retrieval corruption / noisy context

The goal is simple: help people measure reliability under stress, not just benchmark performance on clean inputs.

Why I built it:
My own agent used to take 3 attempts to get the accurate answer I'm looking for :/ , or timeout when handling 10 pager long documents.
I also kept seeing agents look good on polished demos and clean evals, then fail for very ordinary reasons in real workflows. I wanted a simple way to reproduce those failure modes locally, without setting up a lot of infra.

It is open source, runs locally, and is meant to be easy to plug into existing agent workflows.

Repo: https://github.com/Corbell-AI/evalmonkey Apache 2.0

Curious what breaks your agent most often in practice:
bad tool outputs, rate limits, long context, retrieval issues, or something else?

u/Busy_Weather_7064 — 12 hours ago

Anthropic confirmed their best model won't be public. 50 companies get it. We're not one of them.

Anthropic confirmed Claude Mythos (apparently their most capable model ever built) isn't going public. 50 organizations get access through a gated program called Project Glasswing. That's it.

I understand the reasoning. A model that's reportedly excellent at finding security vulnerabilities doesn't get a public API on day one. The responsible deployment argument is real.

But here's the practical impact for early-stage startups: we're now in a two-tier market. Fifty organizations get to build on capabilities the rest of us can't access. If Mythos is as capable as early reports suggest, those 50 companies have an 18-month head start on whatever product categories require that level of reasoning.

The compounding question nobody's talking about: the organizations with Glasswing access are almost certainly large enterprises, not pre-seed startups. They'll define what the frontier model is actually used for, ship products that set user expectations, and by the time public access opens, the category leaders will be entrenched.

OpenAI went through a version of this with GPT-4 access tiers in 2023. The early-access holders didn't dominate every category, but they owned the initial product narrative.

Nothing actionable here if you're a small team; we don't have the leverage to get into a 50-org whitelist. But if your product roadmap depends on frontier-level reasoning, worth acknowledging that the constraint is structural rather than just a waitlist.

reddit.com
u/Otherwise_Flan7339 — 3 days ago

Who are the developlers here who care about AI quality?

Something I keep running into is shipping LLM features seems to be easy, but knowing whether they're actually good is not.

Curious how people are handling this. Do you....

  • maintain a golden dataset and re-run it on every prompt change?
  • use LLM-as-judge? If so, how do you trust the judge?
  • ship and watch user feedback?
  • something else?

I've been going back and forth on opening a focused group chat for developers who care about this stuff. Just a place that's open to comparing notes and experiences. What do any of you think?

Regardless, superi nterested in how folks here are approaching AI quality, etc.

reddit.com
u/ajdevrel — 3 days ago
▲ 4 r/AIQuality+1 crossposts

LLM API prices dropped ~80% since 2024. How are you updating your production cost models?

Input token costs across major providers have dropped roughly 80% in two years. Gemini Flash-Lite is now $0.25 per million input tokens. Models that cost $15/MTok in early 2024 are at $1-3 now.

The obvious effect is that things that were cost-prohibitive in 2024 are cheap today. But the less obvious effect is that your cost models from 18 months ago are wrong in ways that affect architectural decisions.

Specifically: a lot of teams built caching layers, batching pipelines, and model tiering strategies because frontier model costs forced it. Those decisions made sense at $15/MTok. At $1/MTok some of them are still worth it, and some of them are complexity you're maintaining for a cost that no longer exists.

We had a semantic caching layer that was saving us about $340/month at 2024 prices. At current prices that's $47/month. The engineering time to maintain it is worth more than $47/month. Ripped it out.

Conversely, we had batch jobs that used cheap models because the frontier was too expensive for non-critical tasks. That price signal is gone. Now we're running frontier models on more tasks and getting better output quality at the same cost.

The broader point: cost-driven architectural decisions have a shelf life. Worth auditing which ones were built for a price environment that no longer exists.

reddit.com
u/llamacoded — 2 days ago

switched from liteLLM to a go based proxy, tradeoffs after a month

we were on litellm for about 6 months and it was mostly fine. the thing that eventually killed it for us was streaming latency. every request was getting maybe 5-8ms added which doesn't sound bad until you stack tool calls in a multi-turn agent and the user is sitting there watching a spinner for an extra 200ms per turn. we spent two weeks trying to optimize it and i'm still not sure if it was litellm or our setup but we couldn't get it lower. could totally be skill issue on our end tbh

switched to bifrost which is a go proxy. latency is better but the migration took a bit of effort. we had a few provider configs that didn’t transfer cleanly and one of our test providers isn’t supported yet so we paused that integration. not a blocker for us but worth calling out

the one thing that actually surprised me was the cost logging. we could see per-request costs tagged by endpoint and that's how we found out our summarization step was doing 5 retries on failures and each retry was resending full context. was costing us roughly 3x what we thought for that step. litellm gives you cost data but it's per-provider not per-request so we never would have caught that

that said the docs are still catching up. i had to read go source code once or twice to figure out some config options. filed issues and got responses pretty fast though so that helped

not saying everyone should switch. litellm has way more providers and if you're a python shop extending it is easy. we just had a specific latency problem and this solved it for us

reddit.com
u/llamacoded — 6 days ago

What actually defines high quality in AI generated visuals for you?

I have been playing around with AI generated pictures and short animations and I keep running into the same problem something may look great at first but the more you look at it the more little problems you see.

For still images it is usually things like strange textures or details that do not match up. But with motion it is even more obvious. Loops do not feel smooth the lighting changes at random times or some parts of the frame act differently from frame to frame.

It seems to me that quality in visual AI is not just about how sharp or real something looks it is also about how consistent it is over time.

I want to know how other people here feel about this. Do you care more about realism, smooth motion or how well the frames fit together?

reddit.com
u/Puzzleheaded_Bowl_15 — 10 hours ago