u/Fun_Emergency_4083

Built a small CLI for testing agents. Thought this sub might find it useful since we're all shipping agents that need to actually work.

The problem: you tweak a prompt, swap from Hermes 3 to something else, change a tool config and your agent silently gets worse. You don't know until later.

Built a tool that freezes agent outputs as baselines. Every run diffs against the frozen version. New failures flagged. It caught my model going from 85% to 52% on unseen cases while validation scores looked fine.

pip install rigr

rigr init && rigr test --agent my_agent

https://github.com/Null-Phnix/rigr

Would love to hear if anyone else has hit this kind of regression with their Hermes agents.

Agent eval for the Hermes stack