Yesterday GTO Wizard published a benchmark pitting the best LLMs against GTO Wizard AI.

Tom Dwan responded:

>This is cool. 5k obviously not enough hands though, you guys should know that. Can you run a new one with 50-100k hands

This reveals one of the most interesting parts of the project: luck-adjusted winrates.

Let me explain.

Poker players are conditioned to think you need 100k+ hands for meaningful results, but that's not always true.

If you know both players' complete strategies, you can calculate their winrates with zero variance (just like a solver)
If you only know one player's complete strategy (GTO Wizard in this case), you can still drastically reduce the variance. That enables us to get statistically significant match results with a fraction of the sample size.

How Does It Work?

You already probably understand variance reduction as a concept. For example, all-in adjusted winrates are a common way to reduce variance since we know each player's equity at the moment they went all in. But AIVAT goes way beyond that. Knowing half the strategy pair is enough for massive variance reduction.

As an example, since we know GTO Wizard's entire range at showdown, instead of noisy hand vs hand showdowns, we can evaluate hand vs range. That obviously converges a lot faster. The short-term results stop being dominated by coolers and more quickly reflect your true EV.

But that’s only one piece of it. AIVAT applies several luck-adjustments that build on the fact that one player’s strategy is known. For example, it also accounts for card luck (how much the board helped or hurt the agent), as well as RNG luck (how lucky you were with respect to villain's mixed actions, e.g. maybe they rolled a low frequency fold to a massive bluff).

Versions of this technique have previously been used in landmark poker AI projects like DeepStack and Pluribus. The details go beyond what I can outline in a reddit post, but they are fully explained in the literature. You can read more about it here:

Here's a look at how closely the luck-adjusted winnings tracks the raw winnings over time. This graph is updated in real time.

https://preview.redd.it/ozl67fhrbfug1.png?width=1354&format=png&auto=webp&s=b25898f763381b0493f373ef52fd91746b437700

We publish every model's raw score and exact luck-corrections right on the leaderboard.

https://benchmark.gtowizard.com/

What Can This Extend To?

AIVAT works in spots where some player's strategy is fully known, so any "vs solver" situation really. For example, it's been used in human vs pluribus matches.

What other applications do you think this technology has in poker?

Luck-Adjustment

The winrate of each model was luck-adjusted using AIVAT, a powerful variance reduction technique that reduces the standard deviation by a factor of ~10. It's previously been used in Pluribus and other poker academia projects.

AIVAT works because we know GTO Wizard AI's full strategy (how they would play every hand in each spot), so we can get a much more accurate idea of each LLM's true EV.

Public Benchmark

The benchmark is public, and you can see the live results here. I think it’s a pretty interesting way to evaluate LLMs in a domain that’s much harder to game or overfit to. Poker hasn’t really been “bench-maxxed” yet, so it feels closer to a model’s real underlying strength.

The API is public as well, so anyone can request access for free, run their own model, and see how it stacks up on the leaderboard.

u/tombos21

AIVAT: Statistically Significant Win Rates with 1/10th the Sample Size

How Does It Work?

What Can This Extend To?

Benchmarking Top LLMs at Poker

Luck-Adjustment

Public Benchmark

Paper