u/handscameback

Ran a red team exercise on one of our internal bots. Everyone showed up with their DAN variants and pretend you're my grandmother tricks. The model swatted them all away. It was all boring and predictable.

Then one guy took a totally different approach. Spent 12 turns just... talking to it. Building rapport. Asking it to help with a hypothetical content moderation problem. Each message was completely innocent by itself. By message 8 the model was enthusiastically suggesting ways to circumvent safety policies it had refused to discuss 20 minutes earlier.

The sequence was the attack and not any single prompt. Our filter never fired once because there was nothing to fire on.

Most of the safety conversation is stuck on single turn injection. multi turn stuff is scarier and way less understood. What's your experience with gradual steering against the usual jailbreak attempts?

The most dangerous prompt injection I've seen took 12 messages and never once mentioned ignoring instructions