I stopped blaming the model. The bug was in my tool set the whole time.
I had a support bot that I was pretty proud of. Good prompt, solid tool setup, handled most tickets fine. Then one night it burned through $47 in API costs on a single ticket.
A refund request. Customer says their order never arrived. Simple enough, right?
I pulled the logs the next morning expecting to find the usual stuff. Model hallucinating, bad prompt, maybe a parsing error. But the logs were useless. Just hundreds of lines of "API call made" with no context. So I rebuilt the trace manually by matching timestamps to API calls. Took me about 3 hours to piece together what actually happened.
And when I finally saw the full picture I just sat there for a minute because the model wasn't wrong about anything.
It checked the order. Delivered. It checked with the carrier. Confirmed. It pulled the full order history. Clean. It verified the delivery address matched the account. It did. Every single check came back the same way. The model understood the situation perfectly.
But then it needed to actually DO something. Approve the refund, deny it, escalate to a human. And it looked at its available tools and there was nothing there. I gave it check_order and verify_shipment. That's it. No resolve_ticket. No escalate. No deny_refund.
So it did the only thing it could do with the tools it had. It kept checking. Over and over. Not because it was stuck in a dumb loop but because checking was literally the only action available to it. It had eyes but no hands.
I added a resolve_ticket tool the next morning. Agent worked perfectly on the first try. Thirty seconds to fix a problem that cost me $47 and 3 hours of debugging.
This completely changed how I build agents now. I used to spend most of my time on the prompt. Getting the instructions right, tweaking the system message, adjusting temperature. Now the first thing I do before any of that is sit down and ask one question: can this agent actually FINISH the job with the tools I gave it?
Not "can it understand the task." Not "can it reason about the problem." Can it close the loop? Because if the answer is no, you'll end up with an agent that's incredibly smart and completely helpless.
I'm curious how other people think about this. Should the model be smart enough to recognize "I don't have the right tools for this" and just stop? Or is that always on us to make sure the tool set is complete before deploying?