I’m building a couple of agent workflows right now and every time something breaks I’m basically the one who has to jump in and figure it out 😞
No SRE, no “let’s look into this later”. It’s just me opening traces and trying to make sense of what happened while everything else is on fire.
And it’s always the same loop: open traces -> scroll -> try to guess if it’s retrieval, a tool call, or the prompt doing something weird and you’re just sitting there thinking “why is this different from the last run?”
The worst cases are when nothing actually fails. Everything looks “fine” in the trace, but:
- retrieval returned empty or garbage
- tool call technically worked but with wrong inputs
- or the agent just took a completely different path for no obvious reason
Same input, same code… different behavior 😅
We’re a small team so there’s no one dedicated to this, and honestly we don’t have time to set up a proper observability stack either. We just want something that works and lets us move on.
But right now it feels like every time something breaks I’m the idiot sweating in front of traces trying to debug it while everyone else moves on.
I’ve tried replaying runs, adding logs, etc. but it still feels like guesswork most of the time.
How are people actually dealing with this? Are you setting up proper monitoring for agents, or just debugging things when they break?