I added 26 new visual tasks to MindTrial, under the visual2 prefix.
These are grayscale, somewhat higher-resolution image tasks covering OCR, spatial reasoning, numerical awareness, visual deduction, and pattern completion. All tested models had access to the same Python tool environment.
Because the merged leaderboard now includes models with different task counts, I’m focusing on percentages rather than raw totals.
Old visual → New visual2 pass rate:
- GPT-5.5: 78.8% → 84.6% (+5.8 pts), runtime/task +50.9%
- Gemini 3.1 Pro: 63.6% → 84.6% (+21.0 pts), runtime/task -38.3%, 0 hard errors
- GPT-5.4: 66.7% → 73.1% (+6.4 pts), runtime/task +6.8%
- Claude 4.7 Opus: 51.5% → 65.4% (+13.9 pts), runtime/task -21.3%
- Kimi K2.6: 39.4% → 61.5% (+22.1 pts), runtime/task -13.8%
- Grok 4.20 Beta: 36.4% → 57.7% (+21.3 pts), runtime/task +178.1%
Main takeaway: GPT-5.5 and Gemini 3.1 Pro are basically co-leaders on this new visual slice.
GPT-5.5 had the better accuracy on completed tasks: 88.0% vs. Gemini’s 84.6%.
Gemini had the cleaner reliability profile: same 84.6% pass rate, 0 hard errors, and much better runtime compared with its old visual-task run.
Kimi K2.6 is also interesting: big improvement and strong completed-task accuracy, but still hurt by hard errors and long runtime.
Overall, visual2 seems to be doing what I hoped: OCR is now mostly solvable for top models, while spatial reasoning and visual pattern completion still separate the field.
Selected models on visual2tasks: http://www.petmal.net/shared/mindtrial/results/2026-04-28/mindtrial-eval-selected-models-visual2-tasks-04-2026.html