u/zero0_one1

Image 1 — Gemini 3.5 Flash scores 1479 on the Debate Benchmark. Ratings are Elo-like and centered near 1500.
Image 2 — Gemini 3.5 Flash scores 1479 on the Debate Benchmark. Ratings are Elo-like and centered near 1500.
Image 3 — Gemini 3.5 Flash scores 1479 on the Debate Benchmark. Ratings are Elo-like and centered near 1500.
Image 4 — Gemini 3.5 Flash scores 1479 on the Debate Benchmark. Ratings are Elo-like and centered near 1500.
Image 5 — Gemini 3.5 Flash scores 1479 on the Debate Benchmark. Ratings are Elo-like and centered near 1500.

Gemini 3.5 Flash scores 1479 on the Debate Benchmark. Ratings are Elo-like and centered near 1500.

100s of topics. They include dating apps, school smartphones, older-adult care, shrinkflation, eurozone politics.

Two debates on the same motion with PRO and CON roles reversed.

More info: https://github.com/lechmazur/debate

u/zero0_one1 — 5 hours ago

PACT, head-to-head LLM negotiation benchmark. 20-round buyer-seller bargaining game: each round the AIs can message, the buyer submits a bid and the seller submits an ask. If bid ≥ ask, trade clears at the midpoint. Thousands of matchups.

PACT tests negotiation under partial information: persuasion, commitment, deception, anchoring, threats, and adaptation across repeated rounds.

More info, game logs, charts: https://github.com/lechmazur/pact

GPT-5.5, Opus 4.7, DeepSeek V4 Pro, Gemini 3.1 Pro, Kimi K2.6 are the top 5.

Note that opponent mixes vary by model and charts like Average Profit by Round do not control for them.

Ratings are computed with Glicko-2 and displayed on an Elo-like scale, with new entries starting at 1500.

u/zero0_one1 — 9 days ago

The benchmark uses adversarial, multi-turn debates across 683 curated motions. Each model pair debates the same motion twice with sides swapped.

Scores are Bradley-Terry ratings over side-swapped matchups, reported on an Elo-like scale centered around 1500 for the comparison pool.

The benchmark also tracks a judge-side entertainment diagnostic as a secondary signal.

Each completed debate is intended to be judged by a three-model panel. Mean cross-judge winner agreement on overlapping side-swapped matchups: 0.55.

More charts, transcripts, model profiles, existing qualitative writeup, reports, and raw judgments: https://github.com/lechmazur/debate

Qualitative writeups about newly added models are coming.

Opus 4.7 still leads at 1711 BT.

GPT-5.5 (high) enters at 1574, below GPT-5.4 (high) at 1625.

Grok 4.3 underperforms the older Grok 4.20 Beta 0309 reasoning run: 1512 → 1419.

GLM-5.1 improves over GLM-5: 1536 → 1573.

Kimi K2.6 improves over Kimi K2.5: 1520 → 1568.

Qwen 3.6 Max Preview scores 1535.

DeepSeek V4 Pro improves over DeepSeek V3.2: 1438 → 1517.

Xiaomi MiMo V2.5 Pro improves over Xiaomi MiMo V2 Pro: 1459 → 1553.

Mistral Medium 3.5 High Reasoning enters at 1412, ahead of Mistral Large 3 at 1299.

Tencent Hy3 Preview enters at 1481.

u/zero0_one1 — 15 days ago

GPT-5.5:
xhigh: 94.0→97.5
high: 93.6→96.9
medium: 92.0→95.0
no reasoning: 32.8→37.5

Kimi K2.6 improves over Kimi K2.5 (78.3→91.4) and becomes the #1 open weights model.

DeepSeek V4 Pro improves over DeepSeek V3.2 (50.2→75.7).
DeepSeek V4 Flash scores 53.2.

Qwen 3.6 Max Preview scores 82.2 (Qwen 3.6 Plus scored 71.3).

Tencent Hy3 Preview scores 30.2.

Ling 2.6 1T (no reasoning) scores 10.8.

Previously:
Opus 4.7 (high) scores 41.0 on the Extended NYT Connections Benchmark. Opus 4.7 (no reasoning) scores 15.3. Opus 4.7 (high) refuses to answer 54% of the puzzles. On the subset of questions for which Opus 4.7 provided an answer, it scored 90.9% vs 94.7% for Opus 4.6.

More info: https://github.com/lechmazur/nyt-connections/

u/zero0_one1 — 22 days ago