u/bisonbear2

GPT-5.5 low vs medium vs high vs xhigh: the reasoning curve on 26 real tasks from an open source repo
▲ 471 r/codex+2 crossposts

GPT-5.5 low vs medium vs high vs xhigh: the reasoning curve on 26 real tasks from an open source repo

TL;DR

I ran GPT-5.5 Codex at all reasoning effort settings (low, medium, high, and xhigh) on the same 26 tasks from an open source repo (GraphQL-go-tools, in Go).

Low and medium tied on tests at 21/26, but medium was much better on semantic equivalence with the original human PR, and posted higher review quality. High looked like the practical sweet spot. Xhigh produced the best equivalence/review scores, but was much more expensive.

Reasoning effort seems to change the kind of patch Codex produces, not just the pass rate of the tests.

Low → medium: less heuristic/partial implementation, more repo/domain modeling.

Medium → high: the practical jump. More tasks become complete, integrated, and reviewable without xhigh-level cost.

High → xhigh: quality mode. Better on complex tasks, but expensive and slow.

One broader takeaway for me: this should not have to be a one-off manual benchmark. If reasoning level changes the kind of patch an agent writes, the natural next step is to let the agent test and improve its own setup on real repo work.

Data dump (will explore this later throughout the post):

For this post, “equivalent” means the patch matched the intent of the merged human PR; “code-review pass” means an AI reviewer judged it acceptable; craft/discipline is a 0-4 maintainability/style rubric; footprint risk is how much extra code the agent touched relative to the human patch.

I also made an interactive version with pretty charts and per-task drilldowns here: https://stet.sh/blog/gpt-55-codex-graphql-reasoning-curve

Metric Low Medium High Xhigh
Tests pass 21/26, 80.8% 21/26, 80.8% 25/26, 96.2% 24/26, 92.3%
Equivalent with human patch 4/26, 15.4% 11/26, 42.3% 18/26, 69.2% 23/26, 88.5%
Code-review pass 3/26, 11.5% 5/26, 19.2% 10/26, 38.5% 18/26, 69.2%
Footprint risk mean (lower better) 0.200 0.268 0.314 0.365
Craft/Discipline avg 2.311 2.604 2.736 3.071
Cost per task (avg) $2.65 $3.13 $4.49 $9.77
Cost per task (median) $1.91 $2.87 $3.99 $6.39
Tests passes per dollar 0.3051 0.2577 0.2144 0.0945
Equivalent passes per dollar 0.0581 0.1350 0.1544 0.0905
Mean agent duration 286.9s 411.0s 579.0s 753.3s
Input tokens 61,109,728 109,323,987 159,919,731 217,865,624
Output tokens 198,594 292,734 421,418 569,850
Cached input tokens 58,907,856 105,313,408 154,584,448 189,416,832
Uncached input tokens 2,201,872 4,010,579 5,335,283 28,448,792
Delta Tests Equivalent Code-review pass Footprint risk mean Cost/task Mean duration
Medium minus low +0.0pp +26.9pp +7.7pp +0.068 +$0.49, 1.18x +124.1s
High minus medium +15.4pp +26.9pp +19.2pp +0.046 +$1.35, 1.43x +168.0s
Xhigh minus high -3.8pp +19.2pp +30.8pp +0.051 +$5.29, 2.18x +174.3s
Xhigh minus low +11.5pp +73.1pp +57.7pp +0.165 +$7.12, 3.69x +466.4s

Why I Ran This

After my last post comparing GPT-5.5 vs 5.4 vs Opus 4.7, I was curious how intra-model performance varied with reasoning effort. On X/Reddit/HN I had seen speculation around which reasoning effort level is optimal for GPT-5.5 (with some claiming that low/medium is better than high/xhigh due to "overthinking", a known failure mode for 5.4 and 5.3-codex).

To separate vibes from reality, and figure out where the cost/performance sweet spot is for GPT-5.5, I ran this experiment.

This is not meant to be a universal benchmark result - I don’t have the funds or time to generate statistically significant data. The purpose is closer to "how should I choose the reasoning setting for real repo work?", with GraphQL-Go-Tools as the example repo.

Public benchmarks flatten the reviewer question that most SWEs actually care about: would I actually merge the patch, and do I want to maintain it? That's why I ran this test - to gain more insight, at a small scale, into how coding agents perform on real-world tasks.

Terminal-Bench is primarily esoteric coding questions, SWE-bench verified is contaminated (as in models already have answers baked in), and SWE-bench Pro is useful, but generic. That is not a knock on SWE-bench or Terminal-Bench. Standardized benchmarks are useful, but they mostly answer a binary task-outcome question.

The question I care about day to day is narrower and more annoying: did the agent make the same kind of change a human merged in my codebase, and would I want to own the patch afterward?

Experimental Setup

Each task is derived from a real merged PR or commit. The model gets a frozen repo snapshot, a prompt describing the change, and one attempt to produce a patch in a Docker container. Stet then applies the patch and runs the task's tests in an isolated container to check if it passed/failed.

Then Stet grades the result beyond pass/fail:

  • Equivalence: does the candidate patch accomplish the same behavioral change as the original human patch?
  • Code review: would a reviewer accept the patch, considering correctness, introduced-bug risk, maintainability, and edge cases?
  • Footprint risk: how much additional code did the agent touch when compared with the human patch?
  • Craft/discipline rubrics: attempt to capture non-correct aspects of code. Basically, would a reviewer want to maintain this code. The categories are clarity, simplicity, coherence, intentionality, robustness, instruction adherence, scope discipline, and diff minimality

Every model ran once per task with a single seed. The LLM-as-a-judge model was GPT-5.4. Each patch was scored independently - the judge sees the patch and the task, and was blinded to the model/effort that produced the patch. I also manually inspected representative examples as sanity checks. There was no human calibration pass on this task set, so I would trust the direction of the deltas more than any single absolute score.

As an aside, I've also been using these evaluations as an "autoresearch" optimization loop, not just a benchmark. I tell my agent something like "make AGENTS.md better for this repo"; it proposes an edit, runs Stet on historical tasks, figures out where the candidate was better / worse and why, and iterates to improve the evaluation numbers.

Details:

  • Model: GPT-5.5
  • Harness: Codex 0.128.0
  • Dataset: 26 matched real GraphQL-go-tools tasks.
    • Yes this is small - however running even this used ~50+% of my weekly 20x quota
  • Main metrics:
    • test pass
    • semantic equivalence
    • code-review pass
    • footprint risk
    • craft/discipline custom graders
    • cost and runtime

Low To Medium: From Heuristics To Domain Modeling

Let's jump into the data!

Metric Low Medium Δ
Tests pass 21/26, 80.8% 21/26, 80.8% +0.0pp
Equivalent 4/26, 15.4% 11/26, 42.3% +26.9pp
Code-review pass 3/26, 11.5% 5/26, 19.2% +7.7pp
Footprint risk mean 0.200 0.268 +0.068
Craft/Discipline avg 2.311 2.604 +0.293
Cost/task (mean) $2.65 $3.13 +$0.49, 1.18x
Mean duration 286.9s 411.0s +124.1s

Low and medium both pass tests on 21/26 tasks. If tests were the only metric, low and medium would look tied.

However, when we look at semantic equivalence, the jump from low to medium is 4/26 to 11/26. Similarly, code-review pass jumps from 3/26 to 5/26, and aggregate craft/discipline scores rise from 2.311 to 2.604.

In this slice, tests alone would have missed most of the reasoning-effort differences.

Coding-agent evals that only measure tests can flatten differences that matter to humans reviewing the patch.

Speaking from the perspective of a professional software engineer, the code I want AI to merge into my team's codebase doesn't just pass tests. It is also clear, maintainable, at the correct level of abstraction, and following the codebase's standards.

Example: PR #1297 asks the agent to validate nullable external u/requires dependencies in GraphQL Federation. If a nullable required field comes back null with an error, dependent downstream fetches should not receive that tainted entity.

  • Task: model a subtle federation data-dependency rule, not just add a validation branch.
  • Lower-effort failure mode: low passed tests, but it was non-equivalent and review-failing because it used heuristic required-field/error matching and missed structured nullable u/requires metadata.
  • Higher-effort change: medium became equivalent, passed review, tracked tainted objects, filtered downstream fetch inputs, and improved craft/discipline quality from 1.350 to 3.225.
  • Lesson: medium stops guessing and starts representing the actual federation behavior. High and xhigh stayed in the same quality band, so this is mainly a low-to-medium example.

High Looks Like The Practical Sweet Spot

High vs medium:

Metric Medium High Δ
Tests pass 21/26, 80.8% 25/26, 96.2% +15.4pp
Equivalent 11/26, 42.3% 18/26, 69.2% +26.9pp
Code-review pass 5/26, 19.2% 10/26, 38.5% +19.2pp
Footprint risk mean 0.268 0.314 +0.046
Craft/Discipline avg 2.604 2.736 +0.132
Cost/task (mean) $3.13 $4.49 +$1.35, 1.43x
Mean duration 411.0s 579.0s +168.0s

High was the cleanest practical upgrade. It improved the obvious metrics and the semantic/review metrics, while cost rose meaningfully but not absurdly.

High appears to be the point where the extra tokens pay off in terms of real gains - it’s the point where integration details are correct more often.

Let’s look at some examples:

PR #1209 asks the gRPC datasource to honor GraphQL aliases in response JSON, validate referenced protobuf message types up front, and update mapping coverage for union/interface mutation paths.

  • Task: carry alias/response-key semantics through planning, marshaling, and gRPC mapping coverage.
  • Lower-effort failure mode: low and medium both passed tests but stayed non-equivalent and review-failing. Medium handled much of alias serialization and missing-message validation, but missed the createUser mutation mapping update and overloaded JSONPath with response-key semantics.
  • Higher-effort change: high became the first strict pass. It introduced explicit response-key/alias handling, carried aliases through planning and JSON marshaling, and raised custom quality to 3.625.
  • Lesson: high did not just add more code. It got the integration obligation exactly right. Xhigh also passed, but did not improve the task-level read and was much slower in the regenerated summary (790.7s agent duration versus 314.0s for high).

PR #1155 is a broad gRPC datasource hardening task: support repeated scalar fields, avoid null/invalid message panics, propagate gRPC status codes, allow disabling the datasource, and support dynamic clients.

  • Task: harden several production boundaries across gRPC datasource behavior.
  • Lower-effort failure mode: low and medium were test-green but non-equivalent. Medium improved robustness, but still serialized invalid repeated fields as empty arrays, missed aliased-root planning behavior, and had dynamic-client lifecycle risk.
  • Higher-effort change: high became equivalent and review-passing, with safer nil/invalid handling, status-code propagation, disabled-datasource behavior, and dynamic client-provider coverage.
  • Lesson: this is also a high-vs-xhigh reversal. Xhigh still passed tests, but became non-equivalent and review-failing because disabled datasource semantics and invalid-list behavior were wrong.

Xhigh Is Better Quality, Not Obviously A Better Default

Xhigh vs high:

Metric High Xhigh Δ
Tests pass 25/26, 96.2% 24/26, 92.3% -3.8pp
Equivalent 18/26, 69.2% 23/26, 88.5% +19.2pp
Code-review pass 10/26, 38.5% 18/26, 69.2% +30.8pp
Footprint risk mean 0.314 0.365 +0.051
Craft/Discipline avg 2.736 3.071 +0.335
Cost/task (mean) $4.49 $9.77 +$5.29, 2.18x
Mean duration 579.0s 753.3s +174.3s

Xhigh seems to buy semantic and review quality, but it is not a simple "turn the knob up and everything improves" story. It is expensive, and tests are not monotonic.

It seems like xhigh produces code that is more aligned with human intent, covering more bases, and making more complete changes, at the cost of way more tokens. The review-rubric mean/median tells the same story: xhigh scored 3.365 mean / 3.500 median, versus high at 2.817 mean / 2.750 median. The median being above the mean matters: this was not just one or two great xhigh patches dragging up the average.

One caveat: xhigh looked more semantically complete, but it also tended to touch more code relative to the human patch, increasing the footprint risk. That is the interesting tension in this run: xhigh was much more likely to match the human PR semantically, but it was also more willing to expand the patch surface.

I checked whether that extra surface was mostly tests or production logic. Using a simple file-path split across the 26 matched tasks, xhigh added 13,144 lines total: 5,918 implementation lines and 7,226 test, fixture, or expected-output lines. Compared with high, xhigh added 2,631 more lines, and 2,436 of those extra added lines were in test/fixture/expected-output files. So the footprint increase is not just "the model wrote a huge pile of production code." A lot of it is xhigh building more verification and fixture coverage. Still, that is real review surface: someone has to read and maintain those tests, fixtures, and expected-output updates too.

Some examples:

PR #1076 restructures subscription handling to avoid shared-mutex race conditions: per-subscription serialized writes, per-subscription heartbeat control, race detector coverage, and corrected WebSocket close semantics.

  • Task: remove concurrency risk from subscription delivery without breaking close/unsubscribe behavior.
  • Lower-effort failure mode: medium passed tests but was non-equivalent and review-failing. High became equivalent and instruction-adherent, but still failed review because the new worker queue could block the global subscription event loop, shutdown could hang behind a stuck worker, hung updates were unbounded, and client-level unsubscribe still skipped internal subscriptions.
  • Higher-effort change: xhigh was the first strict pass and raised custom quality to 3.475.
  • Lesson: this is the best example of xhigh as a quality mode. The extra spend bought review-risk cleanup in a concurrency-heavy task that simpler signals did not fully capture.

PR #1308 implements GraphQL u/oneOf input objects: add the built-in directive, expose it through introspection, validate operation literals and runtime variables, and improve undefined-variable source locations.

  • Task: implement a cross-cutting GraphQL validation feature across schema, introspection, operation validation, and runtime variables.
  • Lower-effort failure mode: medium and high both passed tests but stayed non-equivalent and review-failing because they missed important u/oneOf semantics around runtime variables, nullable variables, provided-null payloads, or introspection shape.
  • Higher-effort change: xhigh was the first strict pass, with robustness 3.7, instruction adherence 4.0, and custom quality 3.525.
  • Lesson: the difference is not superficial polish. Xhigh handled edge-case coverage across several parts of the system.

PR #1240 asks the agent to consolidate GraphQL AST field-selection merging and inline-fragment selection merging into a single normalization walk.

  • Task: refactor duplicated normalization behavior without changing the executable merge semantics.
  • Lower-effort success: low and high were strict passes.
  • Higher-effort failure mode: xhigh remained equivalent at the semantic-grader level, but review failed because it still preserved prioritized subpasses, changed AbstractFieldNormalizer ordering, and left obsolete field-merge registration behind.
  • Lesson: higher reasoning can produce a more elaborate, plausible refactor while still missing the exact executable behavior the tests and reviewer care about.

Craft And Discipline

The custom graders show the same broad lift as the review rubric. Xhigh's all-custom score was 3.071 mean / 3.087 median, versus high at 2.736 mean / 2.688 median. Craft and discipline were both higher at the median too, which supports the read that xhigh generally improved patch quality rather than only producing a few standout examples.

Metric Low mean / median Medium mean / median High mean / median Xhigh mean / median
Craft aggregate 2.327 / 2.338 2.618 / 2.525 2.781 / 2.787 3.126 / 3.100
Discipline aggregate 2.295 / 2.325 2.590 / 2.588 2.691 / 2.688 3.015 / 3.013
All custom graders 2.311 / 2.338 2.604 / 2.550 2.736 / 2.688 3.071 / 3.087
Delta Medium minus low High minus medium Xhigh minus high Xhigh minus low
Craft average +0.291 +0.162 +0.345 +0.799
Discipline average +0.295 +0.101 +0.324 +0.720
All custom graders +0.293 +0.132 +0.335 +0.760
Simplicity +0.069 +0.038 +0.315 +0.423
Coherence +0.500 +0.181 +0.423 +1.104
Intentionality +0.147 +0.004 +0.065 +0.216
Robustness +0.450 +0.427 +0.577 +1.454
Clarity +0.054 +0.088 +0.131 +0.273
Instruction adherence +0.531 +0.519 +0.381 +1.431
Scope discipline +0.354 -0.058 +0.381 +0.677
Diff minimality +0.242 -0.146 +0.404 +0.500

From this, we can interpret:

  • Low had weak robustness and instruction adherence.
  • Medium fixed a meaningful amount of that without improving aggregate test pass.
  • High improved practical correctness and robustness.
  • Xhigh improved almost every dimension, including scope and diff discipline.

Cost And Runtime

Reasoning effort Task cost mean Task cost median Agent duration mean Agent duration median
Low $2.65 $1.91 286.9s 294.6s
Medium $3.13 $2.87 411.0s 371.8s
High $4.49 $3.99 579.0s 572.9s
Xhigh $9.77 $6.39 753.3s 732.7s

Cost is skewed at low and especially xhigh. Xhigh is still more expensive at the median, but the mean is pulled up by a few expensive tasks. Runtime broadly tracks the mean, so the cost story is more skewed than the time story. Xhigh is still clearly slower at the median.

  • High costs about 1.43x medium per task.
  • Xhigh costs about 2.18x high per task.
  • Xhigh cost is skewed by outliers, but its median task cost is still higher than high.

Limitations

I am not pretending that this is a statistically significant result, or that this result will carry over to your repo. That's ok!

As long as we're aware that this is just one run, at one point in time, on one repo, we can still use it to gain insights into how we can think about our own reasoning settings, it's helpful.

Specific limitations / methodology gaps:

  • Single seed per task.
  • 26 matched real GraphQL-go-tools tasks.
  • LLM-as-judge was GPT-5.4; judge saw patch/task, not label (so theoretically doesn’t know the model).
  • No grader calibration on this task set.

Prior art

Voratiq's current real-work leaderboard points in the same direction, although the methodology is very different. On their board, GPT-5.5 xhigh is at 1994 vs GPT-5.5 high at 1807, a +187 point / +10.3% rating lift; cost is $4.23 vs $2.52 (+67.9%) and duration is 11.9m vs 7.8m (+52.6%). My Stet slice shows a larger high → xhigh lift on equivalence (+19.2pp, +27.8% relative) and code-review pass (+30.8pp, +80.0% relative), but a very similar lift on the craft/discipline aggregate (+12.2%).

However, Voratiq is a preference/selection-style leaderboard over ongoing work, while this is one 26-task repo slice across multiple reasoning levels, so these aren't directly comparable. But it does make the shape less surprising. Xhigh seems to buy reviewer-preferred / quality outcomes more than it buys a clean test-pass default.

Conclusion

The data supports using xhigh for ambiguous, cross-cutting, concurrency-heavy, or high-review-risk work. The practical recommendation is to use high as the default daily driver; use medium/lower settings where cost matters more and the task is routine or well-scoped.

Reasoning effort clearly matters, but the curve is not smooth or monotonic task-by-task. Aggregate quality generally improved as reasoning increased, while individual tasks still had reversals where high beat xhigh or a higher setting made a plausible but wrong implementation choice.

Specifically:

  • Medium starts modeling repo/domain semantics more reliably than low.
  • High looks like the best practical setting on this dataset.
  • Xhigh looks like a quality mode, not a default.

What I’ll do moving forward: continue using high as my daily driver, and for exploratory / complex work, use xhigh.

However, your results may vary. This is why teams should measure their own harnesses, on their own tasks, rather than copying global benchmark defaults.

Disclosure: I am building Stet.sh, the local eval tool I used to run this. The product version is that you can ask your coding agent to improve its own setup - for example, make AGENTS.md better - and it uses Stet to test candidate changes against historical repo tasks. If your team is already using coding agents heavily and has a concrete decision in front of you - high vs xhigh, Codex vs Claude Code, an AGENTS.md update, or which tasks are safe to delegate - I am looking for a few teams to run repo-specific trials with. Stet runs entirely locally, using your LLM subscriptions. Join the waitlist at https://www.stet.sh/private or reach out to me directly.

Data is great, but I’m also interested in anecdotal experience. How have people here been finding the behavior of GPT-5.5 at various reasoning efforts? Which one is your default? And if you have changed team defaults based on evidence instead of vibes, I especially want to hear how you measured it.

u/bisonbear2 — 7 days ago
▲ 226 r/OpenaiCodex+2 crossposts

EDIT: A few of you have asked if I'd run this on your repo. I'm doing 5 free in May to refine the methodology (all run locally - I won't see your code). If you're debating models, harness, reasoning levels, AGENTS.md, or SKILL.md, edits, DM me with the decision you're trying to make, and we can go from there! Especially interested in organizations doing evaluations (as I am in one, and run into this problem frequently at work)

TLDR; OpenAI cooked with GPT-5.5

Opus 4.7 writes smaller patches. GPT-5.5 writes patches that more often survive review. Which one you want depends on whether "small" means disciplined or incomplete in your repo.

I ran both models, plus GPT-5.4, on 56 real coding tasks from two open-source repos: 27 tasks from Zod and 29 from graphql-go-tools (these codebases were selected arbitrarily and may not represent your experience - that's the point of why running your own benchmarks is important!) Each model ran in its native agent harness at default settings: Anthropic models in Claude Code, OpenAI models in OpenAI Codex CLI.

The result was not "one model wins everything." GPT-5.5 was the best shipping default across these runs. By "shipping," I mean the model I would most often trust to produce a patch that passes tests, matches the intended human change, and survives code review. Opus 4.7 was still doing something valuable: it wrote much smaller patches.

On Zod, that looked like a real tradeoff. On graphql-go-tools, it looked more like under-implementation.

GPT-5.5 ships more often. Opus 4.7 ships smaller. Which one wins on your repo depends on whether your bottleneck is review or footprint.

That distinction is why repo-specific evals matter. Public benchmarks flatten model behavior into one number aggregated at massive scale. Real code turns it into a workflow decision on your specific codebase and standards.

I used Stet, an evaluation framework I am building for real-repo coding-agent benchmarks, to grade more than test pass/fail: behavioral equivalence to the human patch, code-review acceptability, footprint risk, and craft/discipline rubrics. This post is not a claim about all coding tasks. It is a concrete look at how three frontier models behaved on two real codebases.

Model Harness Reasoning Level
Opus 4.7 Claude Code high
GPT-5.4 Codex CLI high
GPT-5.5 Codex CLI high

The short version

Across 56 scored tasks:

Metric Opus 4.7 GPT-5.4 GPT-5.5
Tests pass 33/56 31/56 38/56
Equivalent to human patch 19/56 35/56 40/56
Clean pass: tests + review 10/56 11/56 28/56
Mean footprint risk, lower is better 0.20 0.34 0.32
Mean time/task 11m18s 8m24s 6m56s
Estimated run cost $3.43 $2.39 $2.86

GPT-5.5 is the quality leader. It passes the most tests, matches the human patch most often, and clears the reviewer about three times as often as Opus.

Opus is the footprint leader. Its patches are smaller and lower-risk by Stet's footprint model. But a small patch is only good when it is complete. The recurring Opus failure mode is passing the visible tests while missing companion work the human PR included.

GPT-5.5 is also the efficiency leader on tokens and wall-clock. It used fewer input tokens, fewer output tokens, and less summed agent time than either competitor. GPT-5.4 is still the cost leader because its pricing is lower, but the cost advantage did not offset the clean-pass gap in these runs.

The repo split is where the result gets interesting:

Repo Model Tests Equiv yes Review pass Clean pass
Zod, 27 scored tasks Opus 4.7 12 11 6 5
Zod, 27 scored tasks GPT-5.4 9 18 10 5
Zod, 27 scored tasks GPT-5.5 12 18 14 10
graphql-go-tools, 29 tasks Opus 4.7 21 8 5 5
graphql-go-tools, 29 tasks GPT-5.4 22 17 6 6
graphql-go-tools, 29 tasks GPT-5.5 26 22 19 18

On Zod, GPT-5.5 and Opus tie on tests. GPT-5.5 wins on reviewer judgment. Opus wins on diff size.

On graphql-go-tools, GPT-5.5 wins outright. It passes more tests, produces far more clean passes, and is closer to the human patch. Opus still writes the smallest patches, but the small-patch strategy misses too much.

Full scorecard

Metric Opus 4.7 GPT-5.4 GPT-5.5
Code-review pass 11/56 16/56 33/56
Code-review avg: correctness + bug safety 2.33 2.59 3.08
- Correctness 2.11 2.60 3.16
- Introduced-bug safety 2.55 2.56 3.04
- Maintainability, GraphQL only 2.07 2.55 3.03
Custom grader avg, 8 rubrics 2.33 2.40 2.62
Craft score, 0-4 2.41 2.54 2.78
- Clarity / coherence / robustness 2.56 / 1.95 / 1.92 2.75 / 2.18 / 2.43 2.91 / 2.51 / 2.69
Discipline score, 0-4 2.20 2.16 2.36
- Scope discipline / diff minimality 2.39 / 2.42 2.18 / 2.28 2.45 / 2.46
Total input tokens 239.1M 222.3M 201.8M
Total output tokens 1.29M 1.09M 0.72M

The quality-score rows are there to avoid treating "more tests passed" as the whole story. Code review is one grader: correctness, introduced-bug risk, and maintainability where available. The custom grader average is separate: eight additive rubrics split into five craft dimensions and three discipline dimensions. Across both layers, GPT-5.5 is not merely preferred in the abstract. It is rated higher on correctness, lower introduced-bug risk, GraphQL maintainability, coherence, robustness, scope discipline, and diff minimality relative to the requested task. Opus still wins the mechanical footprint row, which is the useful tension: smaller diffs, but not consistently more disciplined diffs.

How the benchmark works

Each task is derived from a real merged commit. The model gets a frozen repo snapshot, a prompt describing the change, and one attempt to produce a patch — running in its native shipped agent harness with no Stet-side scaffolding: Opus 4.7 in Claude Code (claude -p); GPT-5.5 and GPT-5.4 in OpenAI Codex CLI (codex exec); both at default settings. Stet applies the patch and runs the task's tests in an isolated container.

Then Stet grades the result beyond pass/fail:

  • Tests: did the patch satisfy the executable acceptance tests?
  • Equivalence: does the candidate patch accomplish the same behavioral change as the original human patch?
  • Code review: would a reviewer accept the patch, considering correctness, introduced-bug risk, maintainability, and edge cases?
  • Footprint risk: how much review and regression surface did the patch create?
  • Craft/discipline rubrics: clarity, simplicity, coherence, intentionality, robustness, instruction adherence, scope discipline, and diff minimality.

Every model ran once per task with a single seed. The judge model for equivalence and rubrics was GPT-5.4, run with identical rubric versions across all three arms. Each patch was scored independently — the judge sees the patch and the task, not the arm label or the model that produced it. There is no dual-rater calibration, so treat absolute scores as directional; the cross-arm deltas are the thing to trust.

Tests are signal, not the finish line

The most useful row in the table is not tests. It is clean pass: tests pass and the code-review grader accepts the patch.

On Zod, Opus and GPT-5.5 both passed 12 of 27 scored tasks. If you stop there, the models look tied. But GPT-5.5 produced 10 clean passes; Opus produced 5.

On graphql-go-tools, the same pattern was amplified. GPT-5.5 passed 26 of 29 tests and produced 18 clean passes. Opus passed 21 tests but produced only 5 clean passes.

That is the gap you feel in code review. The tests say "this patch probably works." The reviewer asks "is this the patch we want to maintain?"

One GraphQL task shows the difference. PR #1001 changed an HTTP datasource OnFinished hook so consumers could inspect request and response metadata. All three models passed tests and were judged equivalent. Only GPT-5.5 cleared code review. The other two got warnings around API shape, raw HTTP object exposure, and robustness at the hook boundary.

That is not a benchmark trick, rather, this is reflective normal engineering culture where code is reviewed: three patches can satisfy the same test and still differ materially in review quality. You only want to merge the code that is high-quality and maintainable, even if it technically works.

What the reviewer saw

The code review and craft/discipline rows explain why the result is not reducible to "GPT-5.5 changes more files." Two patch autopsies make the numbers less abstract.

Zod async codecs and defaults. The task was to make codec pipelines work with async transforms, prevent defaults from becoming undefined, and generate stub package manifests for the build. All three models failed tests. If you stop at the test row, the task tells you nothing.

The reviewer found a real ordering underneath. Opus changed 8 files and missed central semantics: defaults could still allow undefined, core codec definitions remained synchronous, generated stubs were not published, and prefault() was tightened even though the request was about .default(). GPT-5.4 got closer with an 11-file patch and was judged behaviorally equivalent, but it still over-tightened adjacent API by restricting prefault. GPT-5.5 also failed tests, but it was judged equivalent and scored better on correctness and introduced-bug risk because it covered the schema/build behavior more cleanly: codec/default tests, version metadata, stub-manifest scripts, and the relevant packages/zod/src/v4/*/schemas.ts surfaces.

That is a different kind of signal from pass/fail. It says GPT-5.5 was not merely getting luckier tests; even on a miss, it more often moved the right pieces.

GraphQL Apollo-compatible validation. PR #1169 aligned field-selection validation errors with GraphQL spec and Apollo Router conventions. All three models produced patches. All three passed tests. Only GPT-5.5 cleared equivalence and review.

Opus touched 11 files and passed tests, but missed enum and wrapped-scalar leaf validation, pointed some leaf-selection locations at the field instead of the selection set, left an inline-fragment message non-spec-compliant, and did not apply validation status uniformly. GPT-5.4 touched 12 files and also passed tests, but broadened behavior in the wrong places: unconditional validation metadata, incomplete enum/wrapped scalar handling, broad request-error conversion, and stale compatibility API.

GPT-5.5 touched fewer files than either one, 10 total and 6 non-test, while still adding more targeted behavior: aligned field-selection messages, requested locations, and centralized Apollo validation metadata. This is the clean reviewer example: tests saw three passes; semantic grading saw one patch that actually matched the convention the PR was trying to establish.

This is what the score rows are trying to summarize. GPT-5.5's biggest review lead is correctness: 3.16 versus 2.60 for GPT-5.4 and 2.11 for Opus. The custom graders say the same thing from another angle: GPT-5.5 leads coherence and robustness because its patches more often carry the change through the repo's existing surfaces instead of stopping at the first passing path.

The discipline row is the one I would not overclaim. GPT-5.5 leads, but narrowly: 2.36 versus 2.20 for Opus and 2.16 for GPT-5.4. Opus wins raw footprint. GPT-5.5 narrowly wins task-relative discipline. The grader is separating "small" from "appropriately scoped." A patch can be compact and still undisciplined if it stops before the task is done.

What Opus is doing

Opus 4.7 is cautious. It writes smaller patches, touches fewer files, and has the lowest footprint risk in both repos.

On Zod, that caution is often attractive. Zod has many contained tasks where the correct move is a precise source edit, a type change, and maybe a small test update. Opus tied GPT-5.5 on tests while keeping the patch footprint lower.

But Opus's restraint has a recurring failure mode: it implements the headline behavior and stops before the companion work is done.

Zod made this easy to see. Zod has parallel Node and Deno trees. The tests exercise the main src/ path, so a patch can pass while leaving Deno mirrors stale. On several Opus test-pass-but-not-equivalent tasks, that is exactly what happened. A CIDR validation change passed tests after Opus touched four files. GPT-5.5 touched eleven, because it updated the parallel distribution surface too. The judge marked Opus non-equivalent because the human patch did the companion work.

The same behavior looked worse on graphql-go-tools. That repo is a Go federation engine with planner, datasource, hook, validation, and runtime paths that need to line up. A minimal patch is not enough if the real change spans several engine surfaces.

On PR #1155, the task covered repeated scalar fields in a gRPC datasource, request building, response marshaling, null and invalid responses, error status information, disabled datasources, and dynamically-created clients. Opus produced no patch. GPT-5.5 passed tests, matched the human patch, and cleared review.

That is the key distinction: Opus's small patches can be discipline on local tasks and under-implementation on integration-heavy tasks.

What changed from GPT-5.4 to GPT-5.5

GPT-5.5 is not just GPT-5.4 with higher pass rates. The failure modes shift.

GPT-5.4 often sees the right general approach but fails in execution. On Zod it had 18 equivalence yes judgments, matching GPT-5.5, but only 9 test passes. The equivalence grader recognized the intended behavior; executable validation still failed.

GPT-5.5 closes that gap. It keeps more of the broad integration behavior while producing fewer broken patches.

Three Zod examples are useful.

First, a schema-to-TypeScript generator. The task asked for a recursive visitor over Zod schema definitions. Opus and GPT-5.5 both recognized it as an implementation task and built the visitor. GPT-5.4 produced repository-instruction files instead of the feature. That is not a subtle algorithmic miss. It misclassified the work.

Second, a recursive parser fix. Both GPT models reached for visit-count tracking. GPT-5.4 added an inProgress sentinel and reset logic. GPT-5.5 kept the count-and-cache-error behavior and removed the extra state. Same broad idea, fewer moving parts, passing tests.

Third, CIDR validation. GPT-5.4 and GPT-5.5 had similar core algorithms: split on /, validate the address, validate the prefix. GPT-5.5 updated the Deno mirrors. GPT-5.4 did not. This is not a reasoning leap. It is repo hygiene.

On graphql-go-tools, the separation is more operational. PR #1232 required deduplicating identical single fetches while rewriting dependency references that pointed at removed duplicates. A patch can look plausible and still leave fetch dependencies stale. GPT-5.5 was the only model to pass tests, match the human behavior, and clear review.

The pattern is: GPT-5.5 does more of the boring integration work that turns a clever local fix into a shippable repo change.

The cost of doing more

GPT-5.5 writes larger patches than Opus.

On graphql-go-tools, average patch size was about 33 KB for GPT-5.5, 27 KB for GPT-5.4, and 19 KB for Opus. The footprint scores move accordingly: Opus 0.19, GPT-5.4 0.32, GPT-5.5 0.34.

That is not free. Bigger patches are harder to review, easier to conflict, and more likely to touch sensitive paths. If your workflow is dominated by auditability, Opus still has a real advantage.

But the craft rubric shows why raw size is not enough. On GraphQL, GPT-5.5 had the largest patches and still slightly led diff minimality relative to the task. The grader is not asking "who changed the fewest bytes?" It is asking "who changed the fewest bytes needed to solve the actual request?"

That distinction is the whole benchmark in miniature. A 5 KB patch that misses required surfaces is not more minimal than a 20 KB patch that finishes the job.

The cost story also changed between repos. On Zod, Opus and GPT-5.5 looked similar operationally: Opus used 53.0M input tokens and 359K output tokens; GPT-5.5 used 50.4M input and 290K output. Opus was faster on summed agent time, 1.99h versus 2.32h, and slightly cheaper, $45.53 versus $46.69.

GraphQL reversed that. Opus used 186.1M input tokens and 934K output tokens. GPT-5.5 used 151.4M input and 431K output. Opus took 8.56h of summed agent time; GPT-5.5 took 4.16h. That does not look like Opus sandbagging. It looks like Opus working longer, emitting more tokens, and still converging on smaller, less complete patches.

The behavior metrics point the same way. On GraphQL, Opus averaged 3.17 explicit planning calls per task; GPT-5.5 averaged zero. Opus made 10.2 patch calls per task; GPT-5.5 made 9.9. Opus was not bailing early. The difference was exploration style: GPT-5.5 made about twice as many shell calls and more search calls, while Opus spent more of its budget in planning and patch rewrite churn. In this repo, broader repo inspection appears to have mattered more than deliberating over a narrower patch.

Model personalities, in one paragraph each

Opus 4.7 — under-reach. Conservative, precise, low-footprint. Strong when the task is local and the desired change has a narrow surface. Weak when the human patch includes companion surfaces the tests do not fully cover. Its failure mode is often "tests pass, but this is not the same change."

GPT-5.4 — right shape, wrong execution. Directionally capable but uneven. It often finds the intended shape, which is why its equivalence numbers are respectable, but it is more prone to stale mirrors, extra bookkeeping, unearned refactors, and patches that the judge likes more than the test suite does.

GPT-5.5 — broader, bigger footprint. More complete on integration surface. It is more likely to update the surrounding code, pass review, and convert intended behavior into passing code. Its risk is patch footprint: when it is wrong, it can be wrong over more files.

Why this matters

The practical question is not "which model is best?"

The practical question is:

For this repo, under this harness, on the kinds of tasks we actually ship, which model produces patches we trust?

The answer changed by repo.

Zod made GPT-5.5 versus Opus look like a tradeoff: same test pass count, GPT-5.5 better reviewer alignment, Opus smaller patches.

graphql-go-tools made the tradeoff less symmetrical: GPT-5.5 was simply more shippable on the measured tasks, while Opus's small-patch advantage came with too much missed integration work.

That is why Stet is built around real repo tasks instead of synthetic prompts. Your repo has its own mirror trees, codegen surfaces, test blind spots, hook conventions, planner invariants, and review standards. You also have your own AGENTS.md, skills, model and harness settings, etc. Those details decide whether a model's "personality" is an asset or a liability.

Caveats

Fifty-six scored tasks is still small. One task swing moves a repo-level rate by a few points. Every model ran once per task. Some close calls would flip on rerun.

The equivalence and rubric judge was GPT-5.4. That can introduce family bias. I do not think it explains the whole result: GPT-5.5 beats GPT-5.4 decisively, Opus still wins footprint, and many Opus equivalence losses are concrete missed files or missing companion surfaces.

Results are also harness-conditional. Claude Code and Codex CLI bring different system prompts, planning loops, and tool surfaces, and each model ran in the harness its vendor ships. Running Opus 4.7 inside Codex via API, or GPT-5.5 inside Claude Code, would change the picture. The numbers here describe these models in the harnesses real engineers actually use them in — not the models in isolation.

Takeaway

If I had to summarize the 56 scored tasks:

  • GPT-5.5 is the best default shipping model across these two repos.
  • Opus 4.7 is still the low-footprint model and can be preferable when narrow diffs matter most.
  • GPT-5.4 is cheaper per task, but not enough better on cost to overcome the clean-pass gap here.
  • Tests alone would have hidden the most important result.
  • The same model ranking changed by repo, which is the point.

The interesting model eval is no longer "can the model solve a hard prompt?" It is "what kind of patch does this model tend to produce in my codebase, and does that match how my team ships software?"

u/bisonbear2 — 18 hours ago