u/AIMadesy

I ran controlled A/B tests on 160 prompt prefix codes over 3 months. Most are placebo. Here's the methodology and what survived.

I ran controlled A/B tests on 160 prompt prefix codes over 3 months. Most are placebo. Here's the methodology and what survived.

BODY:

r/PromptEngineering, posting because I keep seeing "secret prompt codes" threads where people share their favorite prefix (ULTRATHINK, GODMODE, /jailbreak, L99, OODA) with screenshots of one good output and zero baseline comparison. That's not evidence, that's selection bias. So I built a test rig last quarter and ran it for three months. Below is the methodology and the unglamorous findings. Most of this generalizes beyond Claude to any prefix-style instruction code on any frontier model.

The test rig (so you can replicate):

  • 6 task categories: factual Q&A, code review, creative writing, summarization, multi-step reasoning, debugging
  • 5 fresh prompts per category, each run 3x to control for sampling noise — 90 outputs per code
  • Same model snapshot (so a behavior shift is the code, not a model update)
  • Blind comparison: 2 reviewers see code-output and baseline-output unlabeled, score on rubric (specificity, commitment, correctness, length-appropriate)
  • Token deltas measured both ways (does the code make output longer? Shorter? Same?)
  • Each code tested against its own no-prefix baseline, not against another code

The most common mistake in informal "prompt code" testing is comparing two codes against each other instead of against the un-prefixed baseline. If both codes produce similar results, it's not because both work — it might be because Claude/GPT/Gemini is doing the work and the prefix is decorative. Always test against no-prefix.

What I found (most of this isn't model-specific):

1. Most prefix codes are placebo or weakly structural.

Of 160 codes I tested, roughly 100 produced no statistically meaningful difference from baseline (defined as: 2 blind reviewers, ≥60% rubric agreement on which is better, sustained across all 6 task categories). The famous ones (ULTRATHINK, GODMODE, ALPHA, UNCENSORED, sometimes even JAILBREAK variants) are in this bucket. They feel impressive because frontier models are verbose and confident by default. The "before" you remember is rosier than reality.

2. About 7 codes consistently shift reasoning, not just format.

There's a difference between codes that change how the model thinks and codes that change how the output looks. The latter is far more common. The reasoning-shifters I found:

  • A hedge-killer ("commit to one answer, name the second-best, explain why you ruled it out") — wins on decision questions, loses on factual lookups
  • A premise-challenger ("before answering, question whether this is the right question") — wins on strategy, loses on time-pressured operational questions
  • A blind-spot surfacer ("list what the asker probably hasn't considered") — wins on debugging and code review
  • A fuzzy-task decomposer ("break this into testable subtasks with leverage ranking") — wins on planning
  • A time-pressured decision framework — wins on incidents, loses on open-ended strategy
  • Two synthesis structures for multi-output tasks (interview synthesis, PRD scoping)

The pattern: the codes that win are the ones that force a specific reasoning mode you didn't ask for. The codes that lose are the ones that just decorate output.

3. Stacking >2 codes degrades output across frontier models.

I tested 2-, 3-, and 4-code stacks. Past 2, all three models I tested (Sonnet 4.6, GPT-5.4, Gemini 2.5 Pro) start partial-honoring one code and ignoring the others. The stack becomes a coin flip. The L99 + /skeptic pair is the only stack I trust daily; everything else I run solo.

4. Codes rot. Quarterly re-testing is mandatory.

Model updates shift behavior in non-obvious ways. Codes that crushed 6 months ago are quietly underperforming now. ARTIFACTS used to force structured multi-part output; today's models do that by default, so ARTIFACTS adds nothing. Conversely, the hedge-killer code is sharper now than 6 months ago, probably because the models lean harder into hedging by default in newer RLHF passes. If you read a "best prompt codes" listicle that wasn't re-tested in the last quarter, treat it as historical.

5. Universal failure mode: confirmation bias on the "after" output.

When someone posts a "look how much better this prompt is" screenshot, they ran the code 3+ times until they got an output they liked, then compared it to one baseline run. The baseline-to-code shift in any individual sample is dominated by the model's stochastic variance, not by the prefix. Run the code 5 times, run the baseline 5 times, then judge. The signal often disappears.

Practical implication:

Most prompt engineering effort is better spent on:

  • Better context (system prompts, example shots, retrieval) than on prefix tricks
  • Better task decomposition than on bigger single-shot prompts
  • Better evaluation harnesses so you actually know if your prompt is better

Prefix codes are a micro-optimization at best. They matter for the ~7 reasoning-shifters above; they're noise for the rest.

What I built: I run clskillshub.com, a Claude-focused reference site I built using Claude Code. The full breakdown of every tested code (Claude-specific) lives there with before/after outputs, classification (reasoning-shifter vs structural vs placebo), and the test methodology. There's a free 100-code library at clskillshub.com/prompts and a free 40-page Claude guide at clskillshub.com/guide. Paid tiers exist for the deep classification work but you don't need them to start.

Happy to answer methodology questions — especially if you've built your own prompt-eval harness and want to compare notes.

reddit.com
u/AIMadesy — 3 days ago

Spent two days last week converting a 10-year-old SELECT FOR ALL ENTRIES report to a CDS view. The original report ran 4 minutes against a HANA DB, and the team had been "going to refactor it" for about four years. Decided to just do it.

Tried Claude with a generic "convert this ABAP to a CDS view" prompt. Got back hallucinated annotations that don't exist (it invented an u/AccessControl variant that's not real), used FOR ALL ENTRIES inside the view definition itself which isn't a thing in CDS, and confidently told me to use a function module that's been deprecated since 7.4. Confident, totally wrong.

Rewrote the prompt with the actual context: the source tables, the existing report's behavior, the join semantics it had to preserve, the performance target, the SAP release. Final runtime: 22 seconds. Code is half the size. Same model, different prompt.

The pattern that worked: tell the LLM exactly what SAP version, what release, what data shape, what authority objects are in scope, and what's already been ruled out. Generic prompts give generic answers, which read fine until you try to compile.

Same thing applies to RAP migrations, where I've seen even worse hallucinations. It happily invents behavior definition syntax. Same fix, more context up front.

Here's the template I landed on. Bracketed parts get filled in with your specifics:

---

Convert this legacy ABAP report to use a CDS view instead of SELECT...FOR ALL ENTRIES.

Current code: [paste the SELECT block + relevant prep code]

Tables and joins involved: [list]

SAP release: [e.g. S/4HANA 2023, ECC 6.0 EHP8]

Performance target: [current runtime → desired runtime]

Authority objects to honor: [list]

Produce:

  1. The CDS view definition (DEFINE VIEW or DEFINE VIEW ENTITY) with associations

  2. The new ABAP code that consumes the CDS view

  3. The places in the legacy code that can now be deleted

  4. The unit test that proves equivalence (same input produces same output)

  5. The performance impact estimate with reasoning

Use modern ABAP syntax (7.5+). No FORM routines. Flag anything that genuinely doesn't translate cleanly.

---

The "flag anything that doesn't translate" line is the one that earns its keep. Without it Claude papers over genuinely unmappable patterns and you find out at compile time.

Curious if other ABAPers here have noticed the same gap, especially around CDS projection views for OData. The hallucination rate on those feels even higher than basic CDS.

reddit.com
u/AIMadesy — 10 days ago
▲ 2 r/SAP+2 crossposts

I run a small one-person research lab where i test claude prompt codes and skill files. ran a stupid amount of tests for a guide i was writing. three things genuinely changed my workflow.

  1. where you put a scope sentence matters more than how you word it

i used to spend ages tweaking the wording of "review only the database logic" type sentences. then i ran the same wording at the start of a prompt vs at the end and the difference was huge. token output stayed about 30% tighter when scope went first. felt dumb for not testing it sooner.

  1. about half the famous claude prompt codes do nothing

tested around 120 popular codes against a fixed task battery and 47% had no measurable effect over a plain prompt. "take a deep breath" was real on older claude, doesn't reproduce on sonnet 4.6 or opus 4.7. "you are a stanford-trained expert" actually flips negative on reasoning tasks. most "step by step" variants are the default behavior already. you're typing extra characters for no reason.

  1. skill file descriptions are everything

if a skill .md file in ~/.claude/skills/ isn't auto-activating, the description field is almost always why. "helps with database stuff" never triggers. "use when configuring database connection pooling, choosing pool sizes, or debugging connection exhaustion" triggers reliably. vague descriptions match weakly. specific ones win.

full guide is free at Guide if you want the other 9 lessons. 40 pages, instant download, no email signup.

u/AIMadesy — 3 days ago

I have been using Claude Code as my primary AI coding tool for 6 months. The most useful exercise I did was sit down and write out the 8 workflows that actually moved my productivity, plus what each one replaced from my pre-Claude routine. Sharing in case it is useful to anyone trying to figure out where Claude Code earns its keep vs where it is just adding noise.

The pattern that emerged: Claude Code wins where the work is "spec a behavior, get the implementation". It loses where the work is "the spec is in your head and you do not know it yet." So my 8 workflows are all in the first bucket.

1. Test generation
Prompt: paste the function, describe the behavior you want covered, ask for tests in your project's framework.
What it replaced: hand-writing tests. My test count is 3x what it was at the same effort cost.
Where it fails: tests for stateful systems where the test setup itself is the hard part. Claude writes the assertions fine, gets the setup wrong half the time.

2. Code review of my own diffs
Prompt: paste the git diff, ask "review this for bugs, edge cases, missed null checks, and stylistic issues against the existing codebase".
What it replaced: my own re-read of my own code, which I always thought was thorough but never was.
Where it fails: anything requiring product context. Claude does not know your customer.

3. Refactoring legacy code
Prompt: paste the legacy function, describe the new shape you want, ask for a refactor that preserves behavior.
What it replaced: 30-90 minute refactor sessions. Now they are 10 minutes plus a careful review.
Critical: always run tests after. Claude will quietly drop edge-case handling if the original was implicit.

4. Migration scripts
Database migrations, config file rewrites, codemod scripts. Claude is unusually good at one-shot migrations because the input and output schemas are explicit.

5. Debugging unfamiliar codebases
Prompt: paste the stack trace, paste the relevant 2-3 files, describe what you tried.
What it replaced: an hour of grep + git blame archaeology.

6. Tech spec drafting
Describe the feature, ask Claude to write a 1-page spec with goals, non-goals, design questions, and rollout plan. Edit heavily.
What it replaced: the blank-page problem on every new project.

7. CI failure summarization
Paste a 4000-line CI log, ask "what is the actual root cause and which 5 lines do I need to look at."
What it replaced: scrolling.

8. Generating mocks and fixtures
Especially API response fixtures for tests. Describe the shape, paste a real example, ask for 5 variations covering edge cases.
What it replaced: maintaining a fixtures/ directory by hand.

The 8 are listed in order of how often I use them daily. 1, 2, 5 are every day. 3, 4, 7, 8 are weekly. 6 is monthly.

Three workflows I tried that did NOT make the cut after a fair test:

  • Pair programming for new features. Claude is a good autocomplete, a bad pair. The "pair" framing makes me lazy about thinking. Better to spec the behavior, then check Claude's output.
  • Code review of someone else's PR. Claude flags style issues but cannot tell you whether the change is the right shape for the team's roadmap. That is the actual hard part of review.
  • Open-ended exploration. "Help me figure out what to build" never ends well. Claude generates plausible options, none of them yours. Better to bring a hypothesis and ask for poke holes.

Curious what made other people's lists. Specifically interested in workflows where you tried Claude and abandoned it because the human version was actually faster.

If anyone wants the exact prompt templates I use for these 8 plus the next 4 I am still testing, I wrote them up in a longer book at clskillshub.com/career-playbook. Optional, the post above stands on its own.

reddit.com
u/AIMadesy — 12 days ago

Got tired of arguing with my team about which "Claude pro tip" tweets were real and which were vibes, so I built a rig and ran them.

Setup:

  • 24 fixed tasks across writing, coding, analysis
  • Fresh contexts per trial, no carryover
  • Same model (Sonnet 4.6 and Opus 4.7), same temperature
  • 3 blind reviewers rating outputs on decisiveness, accuracy, token efficiency

Headline finding: 47% of the codes showed no statistically significant lift over a plain prompt. Some of the most-upvoted ones on Twitter and this sub were dead weight.

Three patterns that consistently won (out of the 53% that worked):

  1. Front-loaded scope anchors. Putting "Review only the database connection logic in src/db/" at the START of a prompt held scope better than the same wording at the end. Token output ~30% tighter on review tasks.
  2. Explicit OUT OF SCOPE rejection clauses. Telling the model "if a finding is outside the scope above, mark it OUT OF SCOPE rather than including it" cut cross-file noise measurably. Works as the model's escape valve, not a positive constraint.
  3. The L99 prefix. Switches Claude into a less-hedged, more decisive mode. Best for hard architectural decisions, terrible for simple lookups (waste of tokens).

Three that turned out to be placebos:

  1. "Take a deep breath." Real finding on older models, doesn't replicate on Sonnet 4.6 or Opus 4.7. Tested both with and without it on the same task battery, no measurable delta.
  2. "You are a Stanford-trained expert." Slight lift on pure factual recall, flips negative on reasoning tasks. The model gets defensive about its expertise instead of admitting uncertainty.
  3. Most "step by step" variants. Already the default behavior on current Claude. Adding the phrase didn't change structure or quality in our tests.

Happy to walk through methodology, specific test cases, or the codes that flipped between models if anyone wants to dig in.

reddit.com
u/AIMadesy — 13 days ago

Quick context. I'm 25, AI Systems Engineer in Pune, on weekends I run a small Claude Code skills marketplace. The weekend project means I end up on a lot of calls with eng managers. Not formal interviews, just conversations. Last 4 months it has been around 30 of them across Pune, Bangalore, SF, London, Berlin.

Patterns I keep hearing. Some of these surprised me. Posting to see if your local market matches.

1. The "do you use AI tools" question is dead.
Nobody asks this anymore. It is assumed. A friend who runs eng at a 200-person fintech in Bangalore told me he stopped asking in February. Replaced it with "show me your CLAUDE.md" or "walk me through how you would refactor this 1400-line file with Claude Code." The answer to the first question is now "yes, obviously." The signal moved one layer up.

2. Having an OPINION on Cursor vs Claude Code vs Cline is now a screening signal.
I have heard variations of "I will hire someone who has tried 3 tools and can defend why they picked one over someone who has only tried Cursor" from 4 different hiring managers in the last month. The opinion does not have to be the "right" one. It has to be informed and defensible. People who say "I just use Cursor because everyone uses it" do worse in interviews now.

3. Leetcode is not dead, but its weight has dropped.
Several hiring managers said the same thing in different words. "If a candidate solves a medium in 20 minutes I am happy. If they fumble it but then say 'let me think out loud about how I would actually solve this with my agent in real life,' I am more interested." System design and AI-fluency interviews are getting longer. Algo interviews are getting shorter. This is not universal. Some FAANG-adjacent companies are still 100% algo. But the distribution is shifting.

4. The personal CLAUDE.md or .cursorrules file is the new dotfiles.
I cannot tell you how many people have asked me about my personal config. It has become a soft proxy for "this person has actually engaged with the tools, not just installed them." If you have a real one with real conventions for your work, mention it on your resume or in your portfolio. If you do not have one, write one this weekend. It takes 90 minutes and it pays off in interviews for years.

5. Side projects matter again, but only AI-shipped ones.
The "I built a Twitter clone in Rails" portfolio is not a positive signal in 2026. It used to be. Now it reads as "this person is on a 2018 stack and stopped learning." The replacement signal is "I built X with Claude Code in 4 weekends, here is what I learned about agentic workflows, here is the deployed URL with 200 real users." A janky-but-deployed AI-native project beats a polished half-finished todo app.

6. Domain plus AI is the highest-paid combo right now.
The fastest-growing engineering salary band I have heard about is people who combine deep vertical expertise (fintech compliance, healthcare data, legal documents) with AI workflow chops. Generic "AI engineer" pays well. "Healthcare engineer who knows how to ship LLM-powered features under HIPAA" pays a lot more. If you have a domain, lean into it. Do not try to become a generic AI engineer.

7. The premium for AI-fluent devs is between 12 percent and 25 percent right now.
This is the number I hear most often when I push. Levels.fyi data backs it roughly. The premium grows at higher seniority. A staff engineer who is recognizably AI-native earns meaningfully more than a staff engineer who is not. At the junior level the premium is small. At the staff level it is decisive.

8. The "single-tool dependency" is the failure mode hiring managers are starting to filter against.
Someone who only knows Cursor and falls apart when Cursor has an outage is now a flag. Multiple managers told me they explicitly ask "what would you do if your primary AI tool was down for a day?" The right answer is "I have Cline and Aider as backups and the prompts in my head are tool-agnostic." The wrong answer is silence.

reddit.com
u/AIMadesy — 16 days ago