u/Great_Donut1459 — reddlx

https://preview.redd.it/xwsaa7ux54yg1.png?width=1174&format=png&auto=webp&s=56b44e99c8cbafff70c07f2490b5259b2d23e9a5

Spent two weeks building a cross-sectional factor research platform from scratch and used 7 different AI agent configurations to do the actual development. Not a toy benchmark — this covered the full pipeline: data ingestion for both US equities (S&P 500, ADV top 1000/2000) and China A-shares (CSI1000), factor computation, IC analysis, walk-forward backtesting with regime filtering, and live simulated portfolio deployment.

The lineup:

Codex (GPT-series, native environment)
Claude Code + Sonnet 4.6
Claude Code + Opus 4.6
Claude Code + Opus 4.6 Thinking + Sonnet 4.6 (hybrid)
GLM5 + Hermes
GLM5 + Claude Code
Qwen 3.6 Plus + Hermes

Scored on 4 dimensions: task competence (60%), speed/efficiency (15%), cost (15%), ease of setup (10%).

TL;DR results:

Opus 4.6 (7.75/10) — the only one that genuinely understood the project architecture. When I said "build a cross-sectional factor platform," it asked about universe definitions, point-in-time alignment for fundamentals, and time-series vs cross-sectional neutralization before writing a single line of code. Nothing else came close on architectural reasoning. But it's expensive — you'll need the $200/month plan to use it seriously.
Codex (7.45/10) — the reliable workhorse. Not the smartest but the most stable. Consistent code quality, minimal bugs, handles large volumes well. Think of the senior engineer who won't surprise you but won't break things either. Recently Plus quota got reduced though, so I rotate between two accounts.
Opus Think+Sonnet (6.80/10) — best cost/performance ratio. Opus plans the architecture, Sonnet executes. You can switch between them with /model in Claude Code. Recommended default for most people.
GLM5+Claude Code (6.85/10) — surprise performer. GLM5's engineering ability is genuinely close to Sonnet 4.6. Paired with Claude Code's framework it handles non-critical modules well. Good budget option.
Sonnet 4.6 solo (6.10/10) — capable but needs more hand-holding. Doesn't decompose tasks well on its own.

6-7. GLM5+Hermes and Qwen 3.6+Hermes (5.90, 4.35) — struggled hard. The gap was most visible in financial domain knowledge: didn't understand why cross-sectional neutralization matters, couldn't handle PIT data alignment, never proactively addressed survivorship bias. Lighter agent frameworks like Hermes don't catch these corner cases.

Some context on the actual quant project: Ended up with a 13-factor model (mix of price/volume and fundamental factors), running walk-forward on CSI1000 with MA200 regime filter. OOS Sharpe around 0.49 at 20bps cost, MaxDD around -27%. Not amazing but it's a functioning baseline with a live sim portfolio running. The US side was more humbling — ran Alpha158 + Alpha101 + Open Asset Pricing fundamentals across S&P 500 and ADV top 1000/2000, couldn't find any single factor with IC > 0.02. Institutional coverage is just too dense for simple cross-sectional alpha on US large/mid caps.

Practical advice if you're starting a quant project with AI agents: Budget $40/month for CC Pro + Codex Plus. Use Opus for initial architecture, then switch to Codex or Opus Think+Sonnet for implementation. Don't fight with free alternatives — the time savings pay for themselves within a day.

Happy to discuss the factor research side too if there's interest. Running the sim portfolio now and will share updates.