u/Fun_Emergency_4083

Agent eval for the Hermes stack
▲ 2 r/hermesagent+1 crossposts

Agent eval for the Hermes stack

Built a small CLI for testing agents. Thought this sub might find it useful since we're all shipping agents that need to actually work.

The problem: you tweak a prompt, swap from Hermes 3 to something else, change a tool config and your agent silently gets worse. You don't know until later.

Built a tool that freezes agent outputs as baselines. Every run diffs against the frozen version. New failures flagged. It caught my model going from 85% to 52% on unseen cases while validation scores looked fine.

pip install rigr

rigr init && rigr test --agent my_agent

https://github.com/Null-Phnix/rigr

Would love to hear if anyone else has hit this kind of regression with their Hermes agents.

github.com
u/Fun_Emergency_4083 — 12 hours ago