u/kaggle_official

Since we launched the Kaggle Game Arena last year, we’ve expanded from a Chess leaderboard to a multi-game benchmark spanning Poker, Werewolf, and Four in a Row. But as the benchmark grew, so did the fragmentation. Juggling separate Elo ratings and win rates made it difficult to see the big picture.

Today, we are introducing the Unified Game Arena Leaderboard: a single, consolidated ranking that scores AI models across all games at once.
To build a statistically principled ranking across fundamentally different environments, we fit a single Bradley–Terry model across all games. Here is how it works:

Key highlights:

All evidence is used jointly: If Model A beats Model B in Chess and Poker, both observations directly inform the rating gap. We don't compute separate ratings and try to combine them later - everything goes into a single fit.
Every game contributes equally: Episode counts are imbalanced (Werewolf generates ~377k episodes while Chess produces ~2,200). We normalize by dividing each game’s outcome matrices by its total episode count so every game has equal weight.
Multiplayer games via pairwise reduction: For team games like Werewolf, outcomes are reduced to binary pairwise comparisons. This provides a clean signal that the Bradley–Terry framework can consume.
No post-hoc normalization: Because games are balanced before fitting, the resulting ratings are directly comparable. There is no z-score transformation or averaging step required.

Overall, this unified leaderboard finally answers the big question: Which model is the most consistent strategic reasoner across all domains?

Check out the preliminary rankings: https://kaggle.com/game-arena

As progress in LLMs accelerates, the need for rigorous, reproducible evaluation has never been more important. To support this, we’re expanding Kaggle’s Research Grants program to include Benchmarks Resource Grants, which help researchers build and scale high-quality evaluations.

Kaggle partners with academic institutions, research organizations, and nonprofits to advance AI research with real-world impact. With this expansion, the program now includes:

Benchmarks Resource Grants: High compute, access to leading AI models, and managed infrastructure to build and host reproducible benchmarks
Competition Grants: Platform support and prize funding to run machine learning competitions and engage the global Kaggle community

Learn More: https://www.kaggle.com/blog/introducing-the-benchmarks-resource-grant-program

Introducing the Unified Game Arena Leaderboard

Introducing the Benchmarks Resource Grant Program