
been sitting on this for a while but genuinely don't think enough people are talking about it
we spent months tweaking prompts trying to get our agent pipeline to stop dropping steps. better instructions, few-shot examples, chain-of-thought, the whole playbook. marginal gains at best.
then someone on the team did the math. 10-step workflow. 90% per-step accuracy. sounds fine right? that's a 35% end-to-end success rate. we were essentially flipping a coin on every run and blaming the model.
the problem isn't the model. it's that we were asking a probabilistic system to behave like a deterministic one, and just... hoping.
what actually moved the needle for us was treating it more like a software control problem than a prompting problem. a few things specifically:
locking the planning phase. stop letting the LLM decide what to do next mid-run. codify your phases, track state in a database, and let the model execute within those rails. hallucinated plans basically disappear.
splitting context per subagent. one agent trying to do web search + data analysis + code generation in a single context window is a disaster. isolated subagents with clean handoffs cut our error rate significantly and weirdly also reduced costs.
validation at boundaries, not just at the end. we were doing a final output check. what we needed was programmatic checks at every step transition, with hard stops before anything consequential happened. sounds obvious in retrospect.
none of this is magic. it's just software engineering applied to a layer we kept treating as a black box.
curious if others have hit the same wall. what did you end up doing — did you go full state machine or find some middle ground that was less painful to maintain?