u/GermanBusinessInside

OWASP ranks prompt injection #1 in their LLM Top 10, but in most orgs I talk to the defense strategy is still either "we'll deal with it later" or a few regex patterns.

Now that agents are getting access to real systems — customer databases, code execution, internal tools — the attack surface is fundamentally different from a chatbot that can only generate text. An indirect injection in a retrieved document can trigger tool calls, exfiltrate data, or pivot to other agents in a multi-agent setup.

I'm curious how security teams here are actually approaching this:

  • Are you treating LLM inputs as untrusted the same way you'd treat user input in a web app?
  • Is there a classification/scanning layer in front of your agents, or are you relying on the model's own guardrails?
  • For multi-agent systems: are you scanning agent-to-agent messages, or is that assumed safe?
  • How do you handle the false positive problem? "Ignore all previous instructions" is an attack in a banking app but legitimate in a D&D game.

I've been working on this problem for a while (built a classifier specifically for this) and the context-dependent nature of prompt injection is what makes it fundamentally harder than traditional input validation. Same input, completely different risk depending on the application context.

Would love to hear what's working and what's not in practice.

reddit.com
u/GermanBusinessInside — 17 days ago