
I made Claude Code build trading strategies — and built an adversarial harness to stop it from cheating
Inspired by Karpathy's "autoresearch" idea, I pointed a general-purpose coding agent at crypto strategy design and said "go." First few attempts? Sharpe 3.0, 4.0, 5.0. Beautiful equity curves. Then I looked at the code — forward-looking features everywhere, labels leaking into signals, the usual.
The problem isn't that coding agents are bad at quant research. It's that they're great at p-hacking. They'll shift(-1) the wrong direction, normalize on the full sample, tune parameters until the backtest sings — and write a convincing explanation for why it all makes sense. Claude Code, Copilot, whatever — they all do this.
So I built Alpha Forge — a coaching-based adversarial pipeline wrapped around a general-purpose coding agent. The agent writes the strategy code; the harness makes sure it can't cheat:
- 7 specialized LLM judges (leakage, overfit, realism, code smell, etc.) review every plan, every code change, and every result
- 5 deterministic guards scan for forward-looking ops, split contamination, and forbidden file edits — no LLM, pure code
- Mutation budgets cap how many parameters each family can tune before it's forced to fork
- Judges never reject — they coach. Each iteration gets actionable "must fix" feedback, and the coding agent decides how to respond
The harness is agent-agnostic — swap in Claude Code, Cursor, or any future coding agent, same pipeline applies. The insight is that you don't need a custom model for the researcher role. Off-the-shelf
coding agents already generate plausible strategy code. You just need to surround them with enough adversarial scrutiny that cheating becomes harder than doing honest research.
One graduating strategy so far (Sharpe 1.69 validation / 0.38 holdout). More importantly: the divergence-based families all failed. The funding-rate strategy couldn't find signal. The vol compression lineage took 5 forks to converge. Honest failure is the point.