Your agent passes benchmarks. Then a tool returns bad JSON and everything falls apart. I built an open source harness to test that locally. Ollama supported!
Most agent evals test whether an agent can solve the happy-path task.
But in practice, agents usually break somewhere else:
- tool returns malformed JSON
- API rate limits mid-run
- context gets too long
- schema changes slightly
- retrieval quality drops
- prompt injection slips in through context
That gap bothered me, so I built EvalMonkey.
It is an open source local harness for LLM agents that does two things:
- Runs your agent on standard benchmarks
- Re-runs those same tasks under controlled failure conditions to measure how hard it degrades
So instead of only asking:
"Can this agent solve the task?"
you can also ask:
"What happens when reality gets messy?"
A few examples of what it can test:
- malformed tool outputs
- missing fields / schema drift
- latency and rate limit behavior
- prompt injection variants
- long-context stress
- retrieval corruption / noisy context
The goal is simple: help people measure reliability under stress, not just benchmark performance on clean inputs.
Why I built it:
My own agent used to take 3 attempts to get the accurate answer I'm looking for :/ , or timeout when handling 10 pager long documents.
I also kept seeing agents look good on polished demos and clean evals, then fail for very ordinary reasons in real workflows. I wanted a simple way to reproduce those failure modes locally, without setting up a lot of infra.
It is open source, runs locally, and is meant to be easy to plug into existing agent workflows.
Repo: https://github.com/Corbell-AI/evalmonkey Apache 2.0
Curious what breaks your agent most often in practice:
bad tool outputs, rate limits, long context, retrieval issues, or something else?