Multi-agent AI code review: 17 of 18 findings false. Lessons from burning credits
I've been personally experimenting with multi-agent code review: separate specialist agents for security, implementation, testing, and architecture, running them on every PR I open. On one last month, 18 findings reported, only 1 survived manual verification.
The failure modes were consistent:
- Agents flagged files that weren't in the diff at all.
- "code does X" claims without quoting the offending line.
- Functions read in isolation, missing upstream guards.
- Pre-existing patterns flagged as PR-introduced.
What stuck: a triage step before dispatching specialists. It pulls the + line list from gh pr diff, builds a small context bundle (changed files, repo conventions, analogous code), and picks which specialists to run based on PR size and surface. Docs-only? Skip security and architecture. Multi-file change touching auth? All four. Specialists must then quote the offending line for any finding they raise; if they can't, it's dropped before reaching me.
I haven't wired this into CI yet, but the triage step would slot in cleanly as a pre-merge check.
If you're running AI review on real PRs, how are you bounding the false-positive rate? Per-agent diff scope, post-hoc filtering, or something else?