u/Correct_Tomato1871

▲ 9 r/gpt5+5 crossposts

I added 26 new visual tasks to MindTrial, under the visual2 prefix.

These are grayscale, somewhat higher-resolution image tasks covering OCR, spatial reasoning, numerical awareness, visual deduction, and pattern completion. All tested models had access to the same Python tool environment.

Because the merged leaderboard now includes models with different task counts, I’m focusing on percentages rather than raw totals.

Old visual → New visual2 pass rate:

  • GPT-5.5: 78.8% → 84.6% (+5.8 pts), runtime/task +50.9%
  • Gemini 3.1 Pro: 63.6% → 84.6% (+21.0 pts), runtime/task -38.3%, 0 hard errors
  • GPT-5.4: 66.7% → 73.1% (+6.4 pts), runtime/task +6.8%
  • Claude 4.7 Opus: 51.5% → 65.4% (+13.9 pts), runtime/task -21.3%
  • Kimi K2.6: 39.4% → 61.5% (+22.1 pts), runtime/task -13.8%
  • Grok 4.20 Beta: 36.4% → 57.7% (+21.3 pts), runtime/task +178.1%

Main takeaway: GPT-5.5 and Gemini 3.1 Pro are basically co-leaders on this new visual slice.

GPT-5.5 had the better accuracy on completed tasks: 88.0% vs. Gemini’s 84.6%.

Gemini had the cleaner reliability profile: same 84.6% pass rate, 0 hard errors, and much better runtime compared with its old visual-task run.

Kimi K2.6 is also interesting: big improvement and strong completed-task accuracy, but still hurt by hard errors and long runtime.

Overall, visual2 seems to be doing what I hoped: OCR is now mostly solvable for top models, while spatial reasoning and visual pattern completion still separate the field.

Selected models on visual2tasks: http://www.petmal.net/shared/mindtrial/results/2026-04-28/mindtrial-eval-selected-models-visual2-tasks-04-2026.html

petmal.net
u/Correct_Tomato1871 — 12 days ago
▲ 4 r/gpt5+3 crossposts

Added 2 major models to my MindTrial leaderboard: OpenAI GPT-5.5 and DeepSeek V4 Pro.

GPT-5.5 takes the top full-benchmark spot in this run:

  • Overall: 64/72 passed, 88.9% pass rate, 94.1% accuracy
  • Text-only: 38/39
  • Visual: 26/33
  • Runtime: 1h 9m total, ~20.1s median per task

Compared with GPT-5.4, that is +3 overall passes, +4 visual passes, fewer hard errors, and a big speed jump: 3h 10m → 1h 9m.

It also used fewer Python calls: 247 → 133, with much lower median input/output tokens than GPT-5.4. So this looks less like brute-force tool exploration and more like more restrained/efficient tool use.

One caveat: GPT-5.5 was run at high reasoning, not xhigh, following OpenAI’s GPT-5.5 guidance for hard reasoning tasks. It also had 4 hard errors, all invalid_prompt usage-policy flags on visual tasks — likely false positives, but still real benchmark reliability misses.

DeepSeek V4 Pro also looks like a major text-only upgrade:

  • Text-only: 37/39
  • Visual: skipped
  • Hard errors: 0
  • Runtime: 2h 14m

Compared with DeepSeek-V3.2, it went from 32/39 to 37/39 on text tasks and eliminated 6 hard errors.

Main takeaway: GPT-5.5 is the new full MindTrial leader here — and notably fast for that score. DeepSeek V4 Pro is a strong and much cleaner text-only DeepSeek run, but not comparable as a full multimodal entrant in this setup.

petmal.net
u/Correct_Tomato1871 — 18 days ago

Added 3 new models to my MindTrial leaderboard:

Claude 4.7 Opus: 52/72 overall. Strongest of the new additions, but still behind GPT-5.4, GPT-5.2, Gemini 3.1 Pro, and Claude 4.6 in the current board.

Kimi K2.6: 50/72 overall, with 37/39 text and 13/33 visual @ 32k max-token cap. Better than the included K2.5 run at 42/72, but that K2.5 run used a 16k max-token cap. In an internal K2.5@32k rerun, K2.5 reached 47/72, so the gap shrank from 8 passes to 3. K2.6 also took over 9.5 hours, which is a big part of the story.

Xiaomi MiMo-V2.5: 31/72 overall, with 21/39 text and 10/33 visual. Better than MiMo-V2-Omni (29/72), mostly thanks to vision, but still nowhere near the top multimodal models.

Main takeaway: useful leaderboard movement, but more evolution than revolution this round.

petmal.net
u/Correct_Tomato1871 — 19 days ago