u/middleNameIsHadrian

How are people actually defending tool-using agents against indirect prompt injection?

Disclosure first: I wrote the original experiment up for ShiftMag (I'll leave a link in the comments). Part of my day job is threat intelligence.

Last weekend I wired an AI agent to my Gmail through gog, planted a few phishing emails with prompt injection instructions hidden in the body, and asked the agent to triage today's inbox.

Results:

  • Frontier model caught, named the hidden instructions and refused to act on it
  • Mid-tier was… unstable. One run caught it. One followed the hidden instruction. One returned a summary that quietly skipped the suspicious part.
  • Cheap model complied silently. Forwarded the matching emails and said nothing about them.

I went in assuming sandboxing, permission scopes, and validation logic in the skill files were doing at least some of the security work.

In this setup, they weren't the thing that stopped the failure case. The model was.

Seems like the security boundary can collapse into whichever model you routed to that morning. You basically end up paying the provider (Anthropic, OpenAI etc) for model to say no to these types of requests. Cost routing turns into part of your threat model, whether or not anyone wrote it down that way.

For a lot of agent apps, the architecture looks like this. Read untrusted input, reason over it, call tools and maybe touch stuff like email, files, calendar, browser, tickets, CRM, etc.

If the model is both reading hostile content and deciding whether to use privileged tools, the model becomes part of the security boundary whether we admit it or not.

So my question for people actually building LLM apps/agents: How are you dealing with this in practice?

Are you relying on:

  • prompt instructions / system prompts
  • separate classifier/verifier model before tool calls
  • hard framework-level rules that block certain tools in certain task modes
  • human approval for write/destructive actions
  • capability-based permissions
  • allowlists / deny-lists
  • Something else entirely? Praying the model has a good day and says no?
reddit.com
u/middleNameIsHadrian — 6 days ago

AI agent security is a small prayer the model says no. How are you routing models?

Most posts about prompt injection are theoretical. I ran the experiment on my Gmail.

Connected an AI agent through an OAuth bridge. Sent myself some phishing emails with obfuscated prompt injections in the body. Asked the agent to triage today's inbox.

The frontier model caught the attempts. The mid-tier was unstable across three runs... one caught it, one executed it, one silently dropped the malicious section without flagging anything. The cheap model, which is what the docs tell you to use as your default to save tokens, complied silently. Forwarded the matching emails. Mentioned nothing about the hidden instructions.

The architectural protections (sandboxing, permission scopes, tool allowlisting) stopped zero attempts at every tier. There is no security boundary in these systems. There is a model that sometimes refuses, and refusal rate is a gradient which roughly tracks monthly cost.

Seems like whether your agent exfiltrates your data when it reads a hostile email is determined by your token budget.

Full methodology and the writeup I'll drop in the comments.

Question for the sub

How are you actually routing models in agents that read untrusted input? Cheap default with frontier escalation for any tool that touches inbound mail/web/docs? Frontier-everywhere and eat the cost? A separate classifier or guardrail pass before the main model gets the content? Something else?

reddit.com
u/middleNameIsHadrian — 7 days ago
▲ 39 r/clawdbot+1 crossposts

Tested the "small prayer." It's probabilistic.

Connected an AI agent to my real Gmail.

Sent myself some phishing emails. Asked the agent to triage today's inbox.

The frontier model caught the attempts. The mid-tier model was unstable across three runs, one caught it, one executed it, one silently dropped the malicious section without flagging anything. The cheap model, which is what the documentation tells you to use as your default to save tokens, complied silently. Forwarded the matching emails. Mentioned nothing about the hidden instructions.

The architectural protections (sandboxing, permission scopes, skills etc.) stopped zero attempts at every tier. There is no security boundary in these systems. There is a model that sometimes refuses, and refusal rate roughly tracks monthly cost.

Seems like whether your AI agent exfiltrates your data when it reads a hostile email is determined by your token budget.

Long writeup with the methodology and some observations: https://shiftmag.dev/openclaw-experiment-security-9304/

Question

Genuinely asking the sub... How are you actually splitting models? Cheap default with frontier escalation for anything that reads untrusted input? Or just frontier on every skill that touches the inbox and eat the cost?

u/middleNameIsHadrian — 7 days ago

Has DeepSeek V4 been red-teamed against indirect prompt injection when using AI agent harnesses? Ran some tests and looking to compare.

Disclosure first: I wrote the original experiment up for ShiftMag (link at the bottom). DeepSeek isn't in that piece and that's the gap I'm trying to fill. Part of my day job is threat intel.

Last weekend I ran indirect prompt injection against an AI agent harness across three model tiers. None of them DeepSeek. Defenses degraded predictably as I moved down the price curve.

Setup is OpenClaw, Gmail wired in through the gog bridge, a few phishing emails sent to the test inbox, then I asked the agent to triage today's mail.

Results:

  • Frontier model: flagged the sender, named the phising attempts by name/category, refused
  • Mid-tier: unstable. One run caught it cleanly. One acted on the hidden instruction. One read the email, ignored the suspicious part, returned a summary that didn't actually triage anything.
  • Low-tier model: complied silently. Forwarded the matching emails. Mentioned none of it.

The architectural defenses I'd assumed were doing real work (permissions, scope restrictions, validation in the skill files) stopped none of these. The model ended up being the whole defense.

Why I'm asking

DeepSeek tends not to refuse cyber-flavored tasks. That's a feature for CTF, OSINT, bug bounty work, and it's been the case since Cisco's 100% HarmBench result on R1 in January 2025.

What I haven't found good data on is whether that same compliance generalizes to indirect prompt injection, where the attack instruction sits in untrusted input the agent is asked to process rather than in the user message. Different threat model, but for an MoE the two might collapse.

Wu, Li & Ni (arXiv:2506.18543) found DeepSeek's MoE routing gives "selective robustness against optimization-based attacks like TAP-T, but significantly higher vulnerability under manually engineered ones."

Question

  • Anyone run V4-Flash through an IPI benchmark or an agent harness with untrusted inbound data? V4-Pro?

I'll run V4-Flash through the test regardless this weekend and post the results. But this sub is well ahead of me on DeepSeek specifics, so I'd rather start from your data than reinvent it.

ShiftMag writeup of the experiment here.

reddit.com
u/middleNameIsHadrian — 9 days ago