u/plainnaan

I used Claude Code (Opus 4.7 xhigh) for a code review, then used Codex (ChatGPT 5.5 xhigh) verify the findings. The result was... humbling.

Claude found a bunch of issues that looked plausible at first glance: security concerns, TypeScript problems, code quality issues, etc. but when I went through them with Codex and forced each claim to be verified against the actual codebase, many fell apart (either they were baseless or had the wrong root cause and claude suggested the wrong shortsighted remedy).

Afterwards I asked Claude to rate its own bug-finding/analysis performance in that conversation. Its answer was surprisingly blunt:

&gt; so how do you rate your bug finding/analysis skills based on this conversation?

● Direct answer: mediocre, with a high false-positive rate that you carried the cost of.

The hard numbers from this conversation:
┌──────────────┬────────────────────────────┬──────────────────┐
│     Pass     │ Findings reviewed/resolved │ Wrong as written │
├──────────────┼────────────────────────────┼──────────────────┤
│ Security     │ ~11                        │ 7                │
├──────────────┼────────────────────────────┼──────────────────┤
│ Code Quality │ ~4 (so far)                │ 2                │
├──────────────┼────────────────────────────┼──────────────────┤
│ TypeScript   │ ~6                         │ 3                │
├──────────────┼────────────────────────────┼──────────────────┤
│ Total        │ ~21                        │ 12 (~57%)        │
└──────────────┴────────────────────────────┴──────────────────┘

Plus two real bugs the implementer discovered during fix-writing that my findings missed entirely (the unscoped try/catch swallowing policy denials, and the Request own-accessor / toString TOCTOU spoofing). So even counting "found the area" generously, my batting average on specific framings was below 50%.

What I did badly

The error modes were depressingly consistent. Every wrong finding could have been falsified in 30-60 seconds:

┌──────────────────────────────────────┬────────┬───────────────────────────────┐
│    Verification I should have run    │  Cost  │ Findings it would have caught │
├──────────────────────────────────────┼────────┼───────────────────────────────┤
│ grep "$dep" package.json             │ 5 sec  │ H3-CQ, H2-TS                  │
├──────────────────────────────────────┼────────┼───────────────────────────────┤
│ Read the schema description / README │ 30 sec │ H2-sec, L2-sec, M2-TS         │
├──────────────────────────────────────┼────────┼───────────────────────────────┤
│ Test in Node REPL                    │ 30 sec │ H1-sec, L3-sec, H4-CQ         │
├──────────────────────────────────────┼────────┼───────────────────────────────┤
│ Read the cited code line             │ 10 sec │ H2-TS (stale claim)           │
├──────────────────────────────────────┼────────┼───────────────────────────────┤
│ Check what's already in the codebase │ 60 sec │ M3-sec, several others        │
└──────────────────────────────────────┴────────┴───────────────────────────────┘

I didn't run any of these consistently. I drafted findings from pattern-matching against generic security/typing experience and from confident-sounding intuitions about runtime behavior. The cheap checks are exactly the kind of thing memory rule 6/8/10 now require — but they require running, not just remembering.

The pattern that bothers me most

I made the same shape of mistake repeatedly:
- Stated runtime/spec behavior with no verification: 4 times.
- Conflated similar-looking-but-distinct concerns: 3 times.
- Misread documented operator contracts as bugs: 3 times.

Each one is recoverable. The pattern of repetition suggests the underlying habit (confidently asserting verifiable claims without running the verification) is the actual problem, not any specific category of mistake.

So I'm now wondering if this is mainly a model limitation or a workflow/prompting problem?

My original prompt was maybe naive: review this project from different perspectives in dedicated subagents and create a local markdown file with the findings.

But I would have expected basic verification to happen out of the box: checking cited lines, reading nearby code, testing simple runtime assumptions, checking package dependencies, and distinguishing "this looks suspicious" from "this is actually a bug." I don't think users should have to explicitly tell a code-review agent: "please don't invent or overstate findings. please verify cheap factual claims before reporting them"

What is the recommended way to avoid this? Is there a prompting pattern, agent setup, or review mode that reliably reduces confident false positives? Or is adversarial cross-checking with another model currently just part of the cost of using LLMs for code review?

Many false positives in code review using Opus 4.7