u/tombos21

▲ 8 r/poker+1 crossposts

AIVAT: Statistically Significant Win Rates with 1/10th the Sample Size

Yesterday GTO Wizard published a benchmark pitting the best LLMs against GTO Wizard AI.

Tom Dwan responded:

>This is cool. 5k obviously not enough hands though, you guys should know that. Can you run a new one with 50-100k hands

This reveals one of the most interesting parts of the project: luck-adjusted winrates.

Let me explain.

Poker players are conditioned to think you need 100k+ hands for meaningful results, but that's not always true.

  • If you know both players' complete strategies, you can calculate their winrates with zero variance (just like a solver)
  • If you only know one player's complete strategy (GTO Wizard in this case), you can still drastically reduce the variance. That enables us to get statistically significant match results with a fraction of the sample size.

How Does It Work?

You already probably understand variance reduction as a concept. For example, all-in adjusted winrates are a common way to reduce variance since we know each player's equity at the moment they went all in. But AIVAT goes way beyond that. Knowing half the strategy pair is enough for massive variance reduction.

As an example, since we know GTO Wizard's entire range at showdown, instead of noisy hand vs hand showdowns, we can evaluate hand vs range. That obviously converges a lot faster. The short-term results stop being dominated by coolers and more quickly reflect your true EV.

But that’s only one piece of it. AIVAT applies several luck-adjustments that build on the fact that one player’s strategy is known. For example, it also accounts for card luck (how much the board helped or hurt the agent), as well as RNG luck (how lucky you were with respect to villain's mixed actions, e.g. maybe they rolled a low frequency fold to a massive bluff).

Versions of this technique have previously been used in landmark poker AI projects like DeepStack and Pluribus. The details go beyond what I can outline in a reddit post, but they are fully explained in the literature. You can read more about it here:

Here's a look at how closely the luck-adjusted winnings tracks the raw winnings over time. This graph is updated in real time.

https://preview.redd.it/ozl67fhrbfug1.png?width=1354&format=png&auto=webp&s=b25898f763381b0493f373ef52fd91746b437700

We publish every model's raw score and exact luck-corrections right on the leaderboard.

https://benchmark.gtowizard.com/

What Can This Extend To?

AIVAT works in spots where some player's strategy is fully known, so any "vs solver" situation really. For example, it's been used in human vs pluribus matches.

What other applications do you think this technology has in poker?

reddit.com
u/tombos21 — 11 hours ago

Benchmarking Top LLMs at Poker

The world’s best LLMs are still terrible at poker.

We put each model into a 200bb heads-up NLHE match against GTO Wizard AI. The best one lost 16 bb/100.

For context, a strong human pro only loses about ~4 bb/100.

https://preview.redd.it/f55wkon387ug1.png?width=4096&format=png&auto=webp&s=abe72fc351070195e5f8da5ccb831324a1b04d76

The price-performance chart is even more interesting. There's a clear pareto curve. More compute helps, but only up to a point. You can't reason your way out of bad fundamentals.

Grok 4 is the funniest point on the graph: one of the most expensive, least useful poker models.

https://preview.redd.it/t5gxi3zc87ug1.png?width=4096&format=png&auto=webp&s=9baeaecf0b4e47f0b7bede9850ab8d2917a8195e

Luck-Adjustment

The winrate of each model was luck-adjusted using AIVAT, a powerful variance reduction technique that reduces the standard deviation by a factor of ~10. It's previously been used in Pluribus and other poker academia projects.

AIVAT works because we know GTO Wizard AI's full strategy (how they would play every hand in each spot), so we can get a much more accurate idea of each LLM's true EV.

https://preview.redd.it/av38ygs297ug1.png?width=1411&format=png&auto=webp&s=b024d8e945e114ceabfc16a9e10865c16b12bf82

Public Benchmark

Leaderboard: https://benchmark.gtowizard.com/

The benchmark is public, and you can see the live results here. I think it’s a pretty interesting way to evaluate LLMs in a domain that’s much harder to game or overfit to. Poker hasn’t really been “bench-maxxed” yet, so it feels closer to a model’s real underlying strength.

The API is public as well, so anyone can request access for free, run their own model, and see how it stacks up on the leaderboard.

Paper

For those interseted in the details, we've published a paper on arxiv here that covers the methodology and results in more detail.

https://arxiv.org/abs/2603.23660

reddit.com
u/tombos21 — 1 day ago