I ran controlled A/B tests on 160 prompt prefix codes over 3 months. Most are placebo. Here's the methodology and what survived.
I ran controlled A/B tests on 160 prompt prefix codes over 3 months. Most are placebo. Here's the methodology and what survived.
BODY:
r/PromptEngineering, posting because I keep seeing "secret prompt codes" threads where people share their favorite prefix (ULTRATHINK, GODMODE, /jailbreak, L99, OODA) with screenshots of one good output and zero baseline comparison. That's not evidence, that's selection bias. So I built a test rig last quarter and ran it for three months. Below is the methodology and the unglamorous findings. Most of this generalizes beyond Claude to any prefix-style instruction code on any frontier model.
The test rig (so you can replicate):
- 6 task categories: factual Q&A, code review, creative writing, summarization, multi-step reasoning, debugging
- 5 fresh prompts per category, each run 3x to control for sampling noise — 90 outputs per code
- Same model snapshot (so a behavior shift is the code, not a model update)
- Blind comparison: 2 reviewers see code-output and baseline-output unlabeled, score on rubric (specificity, commitment, correctness, length-appropriate)
- Token deltas measured both ways (does the code make output longer? Shorter? Same?)
- Each code tested against its own no-prefix baseline, not against another code
The most common mistake in informal "prompt code" testing is comparing two codes against each other instead of against the un-prefixed baseline. If both codes produce similar results, it's not because both work — it might be because Claude/GPT/Gemini is doing the work and the prefix is decorative. Always test against no-prefix.
What I found (most of this isn't model-specific):
1. Most prefix codes are placebo or weakly structural.
Of 160 codes I tested, roughly 100 produced no statistically meaningful difference from baseline (defined as: 2 blind reviewers, ≥60% rubric agreement on which is better, sustained across all 6 task categories). The famous ones (ULTRATHINK, GODMODE, ALPHA, UNCENSORED, sometimes even JAILBREAK variants) are in this bucket. They feel impressive because frontier models are verbose and confident by default. The "before" you remember is rosier than reality.
2. About 7 codes consistently shift reasoning, not just format.
There's a difference between codes that change how the model thinks and codes that change how the output looks. The latter is far more common. The reasoning-shifters I found:
- A hedge-killer ("commit to one answer, name the second-best, explain why you ruled it out") — wins on decision questions, loses on factual lookups
- A premise-challenger ("before answering, question whether this is the right question") — wins on strategy, loses on time-pressured operational questions
- A blind-spot surfacer ("list what the asker probably hasn't considered") — wins on debugging and code review
- A fuzzy-task decomposer ("break this into testable subtasks with leverage ranking") — wins on planning
- A time-pressured decision framework — wins on incidents, loses on open-ended strategy
- Two synthesis structures for multi-output tasks (interview synthesis, PRD scoping)
The pattern: the codes that win are the ones that force a specific reasoning mode you didn't ask for. The codes that lose are the ones that just decorate output.
3. Stacking >2 codes degrades output across frontier models.
I tested 2-, 3-, and 4-code stacks. Past 2, all three models I tested (Sonnet 4.6, GPT-5.4, Gemini 2.5 Pro) start partial-honoring one code and ignoring the others. The stack becomes a coin flip. The L99 + /skeptic pair is the only stack I trust daily; everything else I run solo.
4. Codes rot. Quarterly re-testing is mandatory.
Model updates shift behavior in non-obvious ways. Codes that crushed 6 months ago are quietly underperforming now. ARTIFACTS used to force structured multi-part output; today's models do that by default, so ARTIFACTS adds nothing. Conversely, the hedge-killer code is sharper now than 6 months ago, probably because the models lean harder into hedging by default in newer RLHF passes. If you read a "best prompt codes" listicle that wasn't re-tested in the last quarter, treat it as historical.
5. Universal failure mode: confirmation bias on the "after" output.
When someone posts a "look how much better this prompt is" screenshot, they ran the code 3+ times until they got an output they liked, then compared it to one baseline run. The baseline-to-code shift in any individual sample is dominated by the model's stochastic variance, not by the prefix. Run the code 5 times, run the baseline 5 times, then judge. The signal often disappears.
Practical implication:
Most prompt engineering effort is better spent on:
- Better context (system prompts, example shots, retrieval) than on prefix tricks
- Better task decomposition than on bigger single-shot prompts
- Better evaluation harnesses so you actually know if your prompt is better
Prefix codes are a micro-optimization at best. They matter for the ~7 reasoning-shifters above; they're noise for the rest.
What I built: I run clskillshub.com, a Claude-focused reference site I built using Claude Code. The full breakdown of every tested code (Claude-specific) lives there with before/after outputs, classification (reasoning-shifter vs structural vs placebo), and the test methodology. There's a free 100-code library at clskillshub.com/prompts and a free 40-page Claude guide at clskillshub.com/guide. Paid tiers exist for the deep classification work but you don't need them to start.
Happy to answer methodology questions — especially if you've built your own prompt-eval harness and want to compare notes.