
I was reading the comments to this post and the overall opinion seemed to be that harness makes little/no difference for ARC-AGI-3. Turns out, it makes a huge difference: Hill-climbing ARC-AGI-3
TLDR: if you save game logs - taken actions, board states and scores - and let LLMs search over them with tools, LLMs are only moderately less efficient than humans in terms of the number of actions taken to beat ARC-AGI-3 games.
>Frontier LLMs struggle out of the box on this benchmark. In our preliminary tests, Opus 4.6 and GPT-5.2 failed to progress beyond Level 3 in any of the preview games (which have up to seven levels) even over a thousand action horizon. In the ARC 2025 preview competition, leaderboard results were dominated by non-LLM exploration-based agents, which typically required 80k–100k+ actions to solve roughly half of the preview levels.
>Humans need around 900 actions to finish the preview games. We investigate how far minimal tooling can push LLM-based agents toward human baseline.
>We find diminishing (even negative) returns with additional hand-engineering, e.g., pre-built functions or memory abstraction modules. Structured search over raw game logs, even exceeding 100k lines, remains tractable and effective under our setup.
And if LLMs are allowed to use Python, they can even beat some games almost optimally.
>A favorite example of algorithmic planning is how our agent solved the last level of ft09 in the near-optimal number of actions. In level 6, clicking a cell toggles it and its four orthogonal neighbors (a classic Lights Out game mechanic). The agent recognizes this structure and constructs a linear system from scratch, solving it via Gaussian elimination to find the analytic 11-click solution (Fig 4b).