EDIT: A few of you have asked if I'd run this on your repo. I'm doing 5 free in May to refine the methodology (all run locally - I won't see your code). If you're debating models, harness, reasoning levels, AGENTS.md, or SKILL.md, edits, DM me with the decision you're trying to make, and we can go from there! Especially interested in organizations doing evaluations (as I am in one, and run into this problem frequently at work)

TLDR; OpenAI cooked with GPT-5.5

Opus 4.7 writes smaller patches. GPT-5.5 writes patches that more often survive review. Which one you want depends on whether "small" means disciplined or incomplete in your repo.

I ran both models, plus GPT-5.4, on 56 real coding tasks from two open-source repos: 27 tasks from Zod and 29 from graphql-go-tools (these codebases were selected arbitrarily and may not represent your experience - that's the point of why running your own benchmarks is important!) Each model ran in its native agent harness at default settings: Anthropic models in Claude Code, OpenAI models in OpenAI Codex CLI.

The result was not "one model wins everything." GPT-5.5 was the best shipping default across these runs. By "shipping," I mean the model I would most often trust to produce a patch that passes tests, matches the intended human change, and survives code review. Opus 4.7 was still doing something valuable: it wrote much smaller patches.

On Zod, that looked like a real tradeoff. On graphql-go-tools, it looked more like under-implementation.

GPT-5.5 ships more often. Opus 4.7 ships smaller. Which one wins on your repo depends on whether your bottleneck is review or footprint.

That distinction is why repo-specific evals matter. Public benchmarks flatten model behavior into one number aggregated at massive scale. Real code turns it into a workflow decision on your specific codebase and standards.

I used Stet, an evaluation framework I am building for real-repo coding-agent benchmarks, to grade more than test pass/fail: behavioral equivalence to the human patch, code-review acceptability, footprint risk, and craft/discipline rubrics. This post is not a claim about all coding tasks. It is a concrete look at how three frontier models behaved on two real codebases.

Model	Harness	Reasoning Level
Opus 4.7	Claude Code	high
GPT-5.4	Codex CLI	high
GPT-5.5	Codex CLI	high

The short version

Across 56 scored tasks:

Metric	Opus 4.7	GPT-5.4	GPT-5.5
Tests pass	33/56	31/56	38/56
Equivalent to human patch	19/56	35/56	40/56
Clean pass: tests + review	10/56	11/56	28/56
Mean footprint risk, lower is better	0.20	0.34	0.32
Mean time/task	11m18s	8m24s	6m56s
Estimated run cost	$3.43	$2.39	$2.86

GPT-5.5 is the quality leader. It passes the most tests, matches the human patch most often, and clears the reviewer about three times as often as Opus.

Opus is the footprint leader. Its patches are smaller and lower-risk by Stet's footprint model. But a small patch is only good when it is complete. The recurring Opus failure mode is passing the visible tests while missing companion work the human PR included.

GPT-5.5 is also the efficiency leader on tokens and wall-clock. It used fewer input tokens, fewer output tokens, and less summed agent time than either competitor. GPT-5.4 is still the cost leader because its pricing is lower, but the cost advantage did not offset the clean-pass gap in these runs.

The repo split is where the result gets interesting:

Repo	Model	Tests	Equiv yes	Review pass	Clean pass
Zod, 27 scored tasks	Opus 4.7	12	11	6	5
Zod, 27 scored tasks	GPT-5.4	9	18	10	5
Zod, 27 scored tasks	GPT-5.5	12	18	14	10
graphql-go-tools, 29 tasks	Opus 4.7	21	8	5	5
graphql-go-tools, 29 tasks	GPT-5.4	22	17	6	6
graphql-go-tools, 29 tasks	GPT-5.5	26	22	19	18

On Zod, GPT-5.5 and Opus tie on tests. GPT-5.5 wins on reviewer judgment. Opus wins on diff size.

On graphql-go-tools, GPT-5.5 wins outright. It passes more tests, produces far more clean passes, and is closer to the human patch. Opus still writes the smallest patches, but the small-patch strategy misses too much.

Full scorecard

Metric	Opus 4.7	GPT-5.4	GPT-5.5
Code-review pass	11/56	16/56	33/56
Code-review avg: correctness + bug safety	2.33	2.59	3.08
- Correctness	2.11	2.60	3.16
- Introduced-bug safety	2.55	2.56	3.04
- Maintainability, GraphQL only	2.07	2.55	3.03
Custom grader avg, 8 rubrics	2.33	2.40	2.62
Craft score, 0-4	2.41	2.54	2.78
- Clarity / coherence / robustness	2.56 / 1.95 / 1.92	2.75 / 2.18 / 2.43	2.91 / 2.51 / 2.69
Discipline score, 0-4	2.20	2.16	2.36
- Scope discipline / diff minimality	2.39 / 2.42	2.18 / 2.28	2.45 / 2.46
Total input tokens	239.1M	222.3M	201.8M
Total output tokens	1.29M	1.09M	0.72M

The quality-score rows are there to avoid treating "more tests passed" as the whole story. Code review is one grader: correctness, introduced-bug risk, and maintainability where available. The custom grader average is separate: eight additive rubrics split into five craft dimensions and three discipline dimensions. Across both layers, GPT-5.5 is not merely preferred in the abstract. It is rated higher on correctness, lower introduced-bug risk, GraphQL maintainability, coherence, robustness, scope discipline, and diff minimality relative to the requested task. Opus still wins the mechanical footprint row, which is the useful tension: smaller diffs, but not consistently more disciplined diffs.

How the benchmark works

Each task is derived from a real merged commit. The model gets a frozen repo snapshot, a prompt describing the change, and one attempt to produce a patch — running in its native shipped agent harness with no Stet-side scaffolding: Opus 4.7 in Claude Code (claude -p); GPT-5.5 and GPT-5.4 in OpenAI Codex CLI (codex exec); both at default settings. Stet applies the patch and runs the task's tests in an isolated container.

Then Stet grades the result beyond pass/fail:

Tests: did the patch satisfy the executable acceptance tests?
Equivalence: does the candidate patch accomplish the same behavioral change as the original human patch?
Code review: would a reviewer accept the patch, considering correctness, introduced-bug risk, maintainability, and edge cases?
Footprint risk: how much review and regression surface did the patch create?
Craft/discipline rubrics: clarity, simplicity, coherence, intentionality, robustness, instruction adherence, scope discipline, and diff minimality.

Every model ran once per task with a single seed. The judge model for equivalence and rubrics was GPT-5.4, run with identical rubric versions across all three arms. Each patch was scored independently — the judge sees the patch and the task, not the arm label or the model that produced it. There is no dual-rater calibration, so treat absolute scores as directional; the cross-arm deltas are the thing to trust.

Tests are signal, not the finish line

The most useful row in the table is not tests. It is clean pass: tests pass and the code-review grader accepts the patch.

On Zod, Opus and GPT-5.5 both passed 12 of 27 scored tasks. If you stop there, the models look tied. But GPT-5.5 produced 10 clean passes; Opus produced 5.

On graphql-go-tools, the same pattern was amplified. GPT-5.5 passed 26 of 29 tests and produced 18 clean passes. Opus passed 21 tests but produced only 5 clean passes.

That is the gap you feel in code review. The tests say "this patch probably works." The reviewer asks "is this the patch we want to maintain?"

One GraphQL task shows the difference. PR #1001 changed an HTTP datasource OnFinished hook so consumers could inspect request and response metadata. All three models passed tests and were judged equivalent. Only GPT-5.5 cleared code review. The other two got warnings around API shape, raw HTTP object exposure, and robustness at the hook boundary.

That is not a benchmark trick, rather, this is reflective normal engineering culture where code is reviewed: three patches can satisfy the same test and still differ materially in review quality. You only want to merge the code that is high-quality and maintainable, even if it technically works.

What the reviewer saw

The code review and craft/discipline rows explain why the result is not reducible to "GPT-5.5 changes more files." Two patch autopsies make the numbers less abstract.

Zod async codecs and defaults. The task was to make codec pipelines work with async transforms, prevent defaults from becoming undefined, and generate stub package manifests for the build. All three models failed tests. If you stop at the test row, the task tells you nothing.

The reviewer found a real ordering underneath. Opus changed 8 files and missed central semantics: defaults could still allow undefined, core codec definitions remained synchronous, generated stubs were not published, and prefault() was tightened even though the request was about .default(). GPT-5.4 got closer with an 11-file patch and was judged behaviorally equivalent, but it still over-tightened adjacent API by restricting prefault. GPT-5.5 also failed tests, but it was judged equivalent and scored better on correctness and introduced-bug risk because it covered the schema/build behavior more cleanly: codec/default tests, version metadata, stub-manifest scripts, and the relevant packages/zod/src/v4/*/schemas.ts surfaces.

That is a different kind of signal from pass/fail. It says GPT-5.5 was not merely getting luckier tests; even on a miss, it more often moved the right pieces.

GraphQL Apollo-compatible validation. PR #1169 aligned field-selection validation errors with GraphQL spec and Apollo Router conventions. All three models produced patches. All three passed tests. Only GPT-5.5 cleared equivalence and review.

Opus touched 11 files and passed tests, but missed enum and wrapped-scalar leaf validation, pointed some leaf-selection locations at the field instead of the selection set, left an inline-fragment message non-spec-compliant, and did not apply validation status uniformly. GPT-5.4 touched 12 files and also passed tests, but broadened behavior in the wrong places: unconditional validation metadata, incomplete enum/wrapped scalar handling, broad request-error conversion, and stale compatibility API.

GPT-5.5 touched fewer files than either one, 10 total and 6 non-test, while still adding more targeted behavior: aligned field-selection messages, requested locations, and centralized Apollo validation metadata. This is the clean reviewer example: tests saw three passes; semantic grading saw one patch that actually matched the convention the PR was trying to establish.

This is what the score rows are trying to summarize. GPT-5.5's biggest review lead is correctness: 3.16 versus 2.60 for GPT-5.4 and 2.11 for Opus. The custom graders say the same thing from another angle: GPT-5.5 leads coherence and robustness because its patches more often carry the change through the repo's existing surfaces instead of stopping at the first passing path.

The discipline row is the one I would not overclaim. GPT-5.5 leads, but narrowly: 2.36 versus 2.20 for Opus and 2.16 for GPT-5.4. Opus wins raw footprint. GPT-5.5 narrowly wins task-relative discipline. The grader is separating "small" from "appropriately scoped." A patch can be compact and still undisciplined if it stops before the task is done.

What Opus is doing

Opus 4.7 is cautious. It writes smaller patches, touches fewer files, and has the lowest footprint risk in both repos.

On Zod, that caution is often attractive. Zod has many contained tasks where the correct move is a precise source edit, a type change, and maybe a small test update. Opus tied GPT-5.5 on tests while keeping the patch footprint lower.

But Opus's restraint has a recurring failure mode: it implements the headline behavior and stops before the companion work is done.

Zod made this easy to see. Zod has parallel Node and Deno trees. The tests exercise the main src/ path, so a patch can pass while leaving Deno mirrors stale. On several Opus test-pass-but-not-equivalent tasks, that is exactly what happened. A CIDR validation change passed tests after Opus touched four files. GPT-5.5 touched eleven, because it updated the parallel distribution surface too. The judge marked Opus non-equivalent because the human patch did the companion work.

The same behavior looked worse on graphql-go-tools. That repo is a Go federation engine with planner, datasource, hook, validation, and runtime paths that need to line up. A minimal patch is not enough if the real change spans several engine surfaces.

On PR #1155, the task covered repeated scalar fields in a gRPC datasource, request building, response marshaling, null and invalid responses, error status information, disabled datasources, and dynamically-created clients. Opus produced no patch. GPT-5.5 passed tests, matched the human patch, and cleared review.

That is the key distinction: Opus's small patches can be discipline on local tasks and under-implementation on integration-heavy tasks.

What changed from GPT-5.4 to GPT-5.5

GPT-5.5 is not just GPT-5.4 with higher pass rates. The failure modes shift.

GPT-5.4 often sees the right general approach but fails in execution. On Zod it had 18 equivalence yes judgments, matching GPT-5.5, but only 9 test passes. The equivalence grader recognized the intended behavior; executable validation still failed.

GPT-5.5 closes that gap. It keeps more of the broad integration behavior while producing fewer broken patches.

Three Zod examples are useful.

First, a schema-to-TypeScript generator. The task asked for a recursive visitor over Zod schema definitions. Opus and GPT-5.5 both recognized it as an implementation task and built the visitor. GPT-5.4 produced repository-instruction files instead of the feature. That is not a subtle algorithmic miss. It misclassified the work.

Second, a recursive parser fix. Both GPT models reached for visit-count tracking. GPT-5.4 added an inProgress sentinel and reset logic. GPT-5.5 kept the count-and-cache-error behavior and removed the extra state. Same broad idea, fewer moving parts, passing tests.

Third, CIDR validation. GPT-5.4 and GPT-5.5 had similar core algorithms: split on /, validate the address, validate the prefix. GPT-5.5 updated the Deno mirrors. GPT-5.4 did not. This is not a reasoning leap. It is repo hygiene.

On graphql-go-tools, the separation is more operational. PR #1232 required deduplicating identical single fetches while rewriting dependency references that pointed at removed duplicates. A patch can look plausible and still leave fetch dependencies stale. GPT-5.5 was the only model to pass tests, match the human behavior, and clear review.

The pattern is: GPT-5.5 does more of the boring integration work that turns a clever local fix into a shippable repo change.

The cost of doing more

GPT-5.5 writes larger patches than Opus.

On graphql-go-tools, average patch size was about 33 KB for GPT-5.5, 27 KB for GPT-5.4, and 19 KB for Opus. The footprint scores move accordingly: Opus 0.19, GPT-5.4 0.32, GPT-5.5 0.34.

That is not free. Bigger patches are harder to review, easier to conflict, and more likely to touch sensitive paths. If your workflow is dominated by auditability, Opus still has a real advantage.

But the craft rubric shows why raw size is not enough. On GraphQL, GPT-5.5 had the largest patches and still slightly led diff minimality relative to the task. The grader is not asking "who changed the fewest bytes?" It is asking "who changed the fewest bytes needed to solve the actual request?"

That distinction is the whole benchmark in miniature. A 5 KB patch that misses required surfaces is not more minimal than a 20 KB patch that finishes the job.

The cost story also changed between repos. On Zod, Opus and GPT-5.5 looked similar operationally: Opus used 53.0M input tokens and 359K output tokens; GPT-5.5 used 50.4M input and 290K output. Opus was faster on summed agent time, 1.99h versus 2.32h, and slightly cheaper, $45.53 versus $46.69.

GraphQL reversed that. Opus used 186.1M input tokens and 934K output tokens. GPT-5.5 used 151.4M input and 431K output. Opus took 8.56h of summed agent time; GPT-5.5 took 4.16h. That does not look like Opus sandbagging. It looks like Opus working longer, emitting more tokens, and still converging on smaller, less complete patches.

The behavior metrics point the same way. On GraphQL, Opus averaged 3.17 explicit planning calls per task; GPT-5.5 averaged zero. Opus made 10.2 patch calls per task; GPT-5.5 made 9.9. Opus was not bailing early. The difference was exploration style: GPT-5.5 made about twice as many shell calls and more search calls, while Opus spent more of its budget in planning and patch rewrite churn. In this repo, broader repo inspection appears to have mattered more than deliberating over a narrower patch.

Model personalities, in one paragraph each

Opus 4.7 — under-reach. Conservative, precise, low-footprint. Strong when the task is local and the desired change has a narrow surface. Weak when the human patch includes companion surfaces the tests do not fully cover. Its failure mode is often "tests pass, but this is not the same change."

GPT-5.4 — right shape, wrong execution. Directionally capable but uneven. It often finds the intended shape, which is why its equivalence numbers are respectable, but it is more prone to stale mirrors, extra bookkeeping, unearned refactors, and patches that the judge likes more than the test suite does.

GPT-5.5 — broader, bigger footprint. More complete on integration surface. It is more likely to update the surrounding code, pass review, and convert intended behavior into passing code. Its risk is patch footprint: when it is wrong, it can be wrong over more files.

Why this matters

The practical question is not "which model is best?"

The practical question is:

For this repo, under this harness, on the kinds of tasks we actually ship, which model produces patches we trust?

The answer changed by repo.

Zod made GPT-5.5 versus Opus look like a tradeoff: same test pass count, GPT-5.5 better reviewer alignment, Opus smaller patches.

graphql-go-tools made the tradeoff less symmetrical: GPT-5.5 was simply more shippable on the measured tasks, while Opus's small-patch advantage came with too much missed integration work.

That is why Stet is built around real repo tasks instead of synthetic prompts. Your repo has its own mirror trees, codegen surfaces, test blind spots, hook conventions, planner invariants, and review standards. You also have your own AGENTS.md, skills, model and harness settings, etc. Those details decide whether a model's "personality" is an asset or a liability.

Caveats

Fifty-six scored tasks is still small. One task swing moves a repo-level rate by a few points. Every model ran once per task. Some close calls would flip on rerun.

The equivalence and rubric judge was GPT-5.4. That can introduce family bias. I do not think it explains the whole result: GPT-5.5 beats GPT-5.4 decisively, Opus still wins footprint, and many Opus equivalence losses are concrete missed files or missing companion surfaces.

Results are also harness-conditional. Claude Code and Codex CLI bring different system prompts, planning loops, and tool surfaces, and each model ran in the harness its vendor ships. Running Opus 4.7 inside Codex via API, or GPT-5.5 inside Claude Code, would change the picture. The numbers here describe these models in the harnesses real engineers actually use them in — not the models in isolation.

Takeaway

If I had to summarize the 56 scored tasks:

GPT-5.5 is the best default shipping model across these two repos.
Opus 4.7 is still the low-footprint model and can be preferable when narrow diffs matter most.
GPT-5.4 is cheaper per task, but not enough better on cost to overcome the clean-pass gap here.
Tests alone would have hidden the most important result.
The same model ranking changed by repo, which is the point.

The interesting model eval is no longer "can the model solve a hard prompt?" It is "what kind of patch does this model tend to produce in my codebase, and does that match how my team ships software?"

r/codex

How do you guys code without seeing the code?

Got rid of 3 files and 5.5 became a beast again

Google use Codex

reset incoming

Plus reset as well

Another reset guys !

It seems GPT 5.5 regained its intellectual prowess

Whats the longest codex has taken to complete a task?

Antigravity engineers use Codex ??

Limit reset just happened, Did you get it?

Have any Plus users received their reset?

uncle TIbo No reset for plus user yet

The short version

Full scorecard

How the benchmark works

Tests are signal, not the finish line

What the reviewer saw

What Opus is doing

What changed from GPT-5.4 to GPT-5.5

The cost of doing more

Model personalities, in one paragraph each

Why this matters

Caveats

Takeaway

some of you need to see this

I commend and applaud the community here for not being so toxic and blind when a product gets degraded

Thanks Tibo

codex is a piece of shit the last few days

Codex burnout

5.5 is dumb

Codex usage draining 2-3x faster afer reset