The agent worked perfectly in testing and completely fell apart the first week in production and the reason was embarrassingly obvious in hindsight.
What I had built was a monitoring and triage agent. It was supposed to watch a source, identify relevant items, score them, and route the high intent ones to a Slack channel for a human to action. Clean loop on paper. Three tools, clear handoffs, straightforward enough.
The failure point was the scoring step. In testing I had been feeding it clean, well formatted inputs. In production the real world data was messier than I expected and the scoring tool was returning inconsistent outputs that the next step in the loop could not reliably parse. Instead of failing loudly it just kept running and routing garbage downstream quietly.
Two things fixed it. First I added an output validation step between scoring and routing so malformed results got flagged instead of passed through. Second I built a dead letter channel in Slack where anything that failed validation landed for manual review instead of disappearing.
Sounds basic but I had not thought carefully enough about what graceful degradation looked like in a live loop versus a clean test environment.
The lesson honestly is that agents break at the handoff layer way more than they break at the tool layer. The individual tools were fine. The assumptions about what one tool would hand to the next were not.
Anyone else found the handoff layer to be where most production failures actually live?