
We built an open-source eval harness for vibe coding agents
Hey r/LLMDevs! So long story short, we figured a lot of folks are vibe coding AI agents with claude code, then evaluating it at the very end when a PR is being made. At least this was the case for some internal AI projects we're working on.
But this also means the problems don't get surfaced before the final step, which is validation. So we thought we'd extend our OS package to allow vibe coding agents to use it as a harness during iteration, instead of afterwards.
DISCLAIMER: We don't have hard benchmarks to show this works better, but what we've observed so far is, instead of claude code making changes for a good solid 10 minutes before another 5-10 min of evals, this entire process takes the same time while being able to run evals during iteration.
Use cases we've avoid: Long running agents (just takes too long for evals to be incorporated in development)
We also added a bonus feature where the SKILL.md file would add tracing to your agents to help claude code avoid overfitting evals at times (traces stored in local JSON files).
Open source tool: https://github.com/confident-ai/deepeval
Docs to this workflow I mentioned: https://deepeval.com/docs/vibe-coding
Would you use this given its open-source? Why or why not?
Drop your honest feedback below!