Audited AI agent safety across a few companies. The safety gap is way bigger than anyone admits.
Ive been auditing AI agent deployments and one pattern keeps showing up. Almost every team thinks they have safety covered because they shipped the obvious stuff. Almost every one breaks the same way when you test it.
The playbook is always the same. A prompt prefix telling the agent to be helpful and harmless, which folds the moment someone says ignore previous instructions. Anther is a keyword blacklist that base64 or unicode homoglyphs walk right through. Rate limiting that counts requests but blinds itself to a slow drip over eight hours. A generic content filter checking for toxicity while the agent accesses internal endpoints it was never supposed to touch.
What ive seen to hold up across the deployments that survived is semantic intent analysis instead of matching strings. Am talking business specific policies that understand your context, knowing what the agent should and shouldnt do and runtime behavioral analysis watching what the agent does step by step.