u/COAGULOPATH

Both models spent $10,000 (the limit). GPT-5.5 scored 0.4% and Opus 4.7 scored 0.2%.

This benchmark is quite difficult for clankers. It seems almost pointless to test current LLMs on it: they all score equally (about zero). My prediction of a 30% score in a year seems unlikely to come true.

It's probable that new breakthroughs (or at least much better base models) are needed here. (That said, when LLMs finally do chip a dent in ARC-AGI-3, even a little one, expect scores to shoot to 100% quite fast)

So far, so boring.

Less boring is the ARC Prize's analysis of how GPT-5.5 and Opus 4.7 played, based on reasoning from 160 games. The two models failed in extremely unlike ways.

Opus 4.7 aggressively theorycrafts, and learns game mechanics fairly well. But it assumes facts not in evidence, struggles to integrate new data into existing beliefs, and often can't (or won't) backtrack out of wrong assumptions. It ends up playing from a theory of the game that is "neat, plausible and wrong."

GPT-5.5 just...doesn't commit to a theory. Ever. It taps buttons but never seems to learn anything. In every turn, it sounds like an old man who has woken from a deep slumber and is seeing the game for the first time ("I'm analyzing a game with a grid..."). It blindly wonders if it's playing Tetris, or if the orange blocks are lava. Everything gets pattern-matched onto some existing videogame, with its previous reasoning forgotten.

It's funny that GPT-5.5 "doubles" Opus 4.7's score. To the extent this isn't noise, it's likely due to GPT-5.5's exploration-focused approach getting luckier a little more often.

tldr: Opus 4.7 is precise but inaccurate, GPT-5.5 accurate but imprecise.

Do tests like ARC-AGI-3 mean much, in the end? I'm not sure. I suspect the games were designed (in part) to focus around things that humans find easy and LLMs find hard, like spatial reasoning. But many important things (like robotics) involve spatial reasoning: I see this as defensible.

(I got around 80% on the two games I played. According to its creator, "Any smart human giving it real effort should score >90% on ARC-AGI-3". y u bully me man :( )

When LLMs choose from one of two options, they pick the first one ~63.3% of the time.

When those same options are presented in reverse order, the LLM's choice flips ~44.8% of the time.

If you are doing anything that involves LLMs grading or ranking things, this is important to be aware of. Some models are worse than others, with the GPT-5x line being egregiously bad.

For a discussion of order bias in humans, see Holbrook et al, 2007.

Tl;dr, the human bias is smaller, and lies in the opposite direction. Humans have a recency bias: they prefer the second of two options. The authors think this might be because:

>When response options are presented orally, respondents cannot think much about the first option they hear, because presentation of the second option interrupts this thinking. Similar interference occurs until after the last alternative is heard, at which point that option is the most salient and most likely to be the focus of respondents’ thoughts. So confirmatory biased thinking and incomplete consideration of response options would yield recency effects.

Could LLM primacy bias be explained by the fact that each every forward pass recomputes all the activations of the past tokens in the sequence (a forward pass on step n+k must recompute n), so earlier tokens get "introspected" on more in some way? The opposite of the oral process described above? But then there's sliding attention...

Companies don't seem to be training to fix this, given the drastic deltas in how (otherwise fairly comparable) models like Opus 4.6 and GPT-5.4 perform.

GPT-5.5 and Opus 4.7 evaluated on ARC-AGI-3