u/sunglasses-guy

We built an open-source eval harness for vibe coding agents

We built an open-source eval harness for vibe coding agents

Hey r/LLMDevs! So long story short, we figured a lot of folks are vibe coding AI agents with claude code, then evaluating it at the very end when a PR is being made. At least this was the case for some internal AI projects we're working on.

But this also means the problems don't get surfaced before the final step, which is validation. So we thought we'd extend our OS package to allow vibe coding agents to use it as a harness during iteration, instead of afterwards.

DISCLAIMER: We don't have hard benchmarks to show this works better, but what we've observed so far is, instead of claude code making changes for a good solid 10 minutes before another 5-10 min of evals, this entire process takes the same time while being able to run evals during iteration.

Use cases we've avoid: Long running agents (just takes too long for evals to be incorporated in development)

We also added a bonus feature where the SKILL.md file would add tracing to your agents to help claude code avoid overfitting evals at times (traces stored in local JSON files).

Open source tool: https://github.com/confident-ai/deepeval

Docs to this workflow I mentioned: https://deepeval.com/docs/vibe-coding

Would you use this given its open-source? Why or why not?

Drop your honest feedback below!

u/sunglasses-guy — 2 days ago
▲ 8 r/deepeval+1 crossposts

Evals for AWS AgentCore

Hey r/aws! I'm one of the maintainers of DeepEval, an open-source framework to evaluate AI agents (it's like Pytest for LLMs), and I wanted to share a recent integration we released with AgentCore that you might find useful.

Long story short, we found:

  1. AgentCore to be increasingly popular with our community, and
  2. No easy way exist to test these agents without being coupled to AWS's platform

So we made evals for AgentCode 100% open-source by integrating it in DeepEval, it's literally 2 lines of code:

https://preview.redd.it/llfgtg1uww1h1.png?width=1366&format=png&auto=webp&s=f30adca0fa9e66ac6e85e5ed6e42e671a220886b

That's literally it. Under the hood, "instrument_agentcore" traces agentcore agents, while "invoke" calls agentcore allowing DeepEval to capture the trace. And once we have the trace, you can simply use DeepEval's metrics for evals, in this code snippet task completion.

You might also notice that we were able to use Pytest, that's because that's what DeepEval wraps.

Anyway, hope this was helpful, super curious to know whether you see yourself using this integration. Not going to drop a link here for obvious reasons but, LMK if you're interested!

reddit.com
u/sunglasses-guy — 2 days ago

Just released DeepEval 4.0, eval harness for coding agents with 1 line integration with LangChain

Hey r/deepeval, I'm one of the maintainers of DeepEval. For those that don't know, DeepEval is an open-source evaluation framework for LLMs. Think Pytest for LLMs.

We're releasing DeepEval 4.0 today, which includes a major component that allow LangChain users to run evals on LangChain traces locally via Pytest.

https://preview.redd.it/o33w7f8euw0h1.png?width=1388&format=png&auto=webp&s=5f33fcce62285d53a560fe84ae61f1a92b7858e7

It also includes a local TUI "inspect trace" mode for those that don't want to indulge in any cloud UI such as LangSmith:

https://preview.redd.it/yrzwyq3nuw0h1.png?width=2454&format=png&auto=webp&s=091f01e89675cedd735d89843438c65ce42300e6

Why did we build this? It's because we found that with coding agents such as vibe coding, the local development workflow that optimizes for speed and efficiency matters now more than ever.

We're making DeepEval the evaluation harness for vibe coding agents such as Claude Code for this reason.

Hope this is interesting, and you can head to our github to see the latest release!

reddit.com
u/sunglasses-guy — 7 days ago

👋 Welcome to r/deepeval - Introduce Yourself and Read First!

Hey everyone! I'm u/sunglasses-guy, one of the maintainres of r/deepeval.

This is our new home for all things related to evaluating AI agents. We're excited to have you join us!

What to Post
Post anything that you think the community would find interesting, helpful, or inspiring. Feel free to share your thoughts, photos, or questions about what have worked for you in testing AI, best practices, challenges you've faced, and of course, and cool open-source projects you've built around AI in general.

Community Vibe
We're all about being friendly, constructive, and inclusive. Let's build a space where everyone feels comfortable sharing and connecting.

How to Get Started

  1. Introduce yourself in the comments below.
  2. Post something today! Even a simple question can spark a great conversation.
  3. If you know someone who would love this community, invite them to join.
  4. Interested in helping out? We're always looking for new moderators, so feel free to reach out to me to apply.

Thanks for being part of the very first wave. Together, let's make r/deepeval amazing.

reddit.com
u/sunglasses-guy — 8 days ago
▲ 6 r/AIEval+1 crossposts

Self-reflection after 4 weeks of evals

Disclaimer: I'm just one dev sharing what I've seen so far. I might not know everything, so take what I say with a grain of salt.

We started running evals seriously about 4 weeks ago. Not just "run some metrics and look at scores" but actually trying to build a real workflow around it. here's what I've learned so far.

Alignment took more time than the evals themselves.

This was the big one. I assumed the hard part would be picking metrics, setting up test cases, getting the infrastructure right. Nope. The hardest part was getting PMs aligned on what "good" even means.

We'd run evals, show results, and then spend hours debating whether a 0.7 on some metric was acceptable or not. PMs would disagree with how metrics scored certain outputs. "That response is fine, why did it fail?" became a recurring conversation. Looking back, we should have spent the first week purely on alignment before writing a single test case Getting everyone to agree on what a good output looks like saves you weeks of back and forth later.

Annotations worked. When people actually did them.

When team members sat down and annotated outputs properly, the quality of our evals improved dramatically. We could calibrate metrics, catch edge cases, and actually trust our scores.

The problem is that "when people actually did them" part. Some weeks were great. Other weeks, the annotation queue just sat there untouched. And when annotations don't happen, you're flying blind. your metrics drift, your datasets go stale, and you lose the human signal that makes evals actually useful.

Not blocking out dedicated time was the biggest mistake.

This is probably the most practical takeaway. We just assumed people would find time to annotate, review results, and participate in the eval workflow. They didn't. everyone has other priorities, and evals always got pushed to "I'll get to it later."

If I could restart these 4 weeks, I'd block out specific recurring time on everyone's calendar from day one. Treat it like a standup. If evals aren't scheduled, they don't happen. It's that simple.

4 weeks in and I think we're in a better spot now, but honestly most of the progress came from fixing the people and process side, not the technical side. Curious if others have had similar experiences

reddit.com
u/sunglasses-guy — 11 days ago