
Agent eval for the Hermes stack
Built a small CLI for testing agents. Thought this sub might find it useful since we're all shipping agents that need to actually work.
The problem: you tweak a prompt, swap from Hermes 3 to something else, change a tool config and your agent silently gets worse. You don't know until later.
Built a tool that freezes agent outputs as baselines. Every run diffs against the frozen version. New failures flagged. It caught my model going from 85% to 52% on unseen cases while validation scores looked fine.
pip install rigr
rigr init && rigr test --agent my_agent
https://github.com/Null-Phnix/rigr
Would love to hear if anyone else has hit this kind of regression with their Hermes agents.
u/Fun_Emergency_4083 — 12 hours ago