AIVAT: Statistically Significant Win Rates with 1/10th the Sample Size
Yesterday GTO Wizard published a benchmark pitting the best LLMs against GTO Wizard AI.
Tom Dwan responded:
>This is cool. 5k obviously not enough hands though, you guys should know that. Can you run a new one with 50-100k hands
This reveals one of the most interesting parts of the project: luck-adjusted winrates.
Let me explain.
Poker players are conditioned to think you need 100k+ hands for meaningful results, but that's not always true.
- If you know both players' complete strategies, you can calculate their winrates with zero variance (just like a solver)
- If you only know one player's complete strategy (GTO Wizard in this case), you can still drastically reduce the variance. That enables us to get statistically significant match results with a fraction of the sample size.
How Does It Work?
You already probably understand variance reduction as a concept. For example, all-in adjusted winrates are a common way to reduce variance since we know each player's equity at the moment they went all in. But AIVAT goes way beyond that. Knowing half the strategy pair is enough for massive variance reduction.
As an example, since we know GTO Wizard's entire range at showdown, instead of noisy hand vs hand showdowns, we can evaluate hand vs range. That obviously converges a lot faster. The short-term results stop being dominated by coolers and more quickly reflect your true EV.
But that’s only one piece of it. AIVAT applies several luck-adjustments that build on the fact that one player’s strategy is known. For example, it also accounts for card luck (how much the board helped or hurt the agent), as well as RNG luck (how lucky you were with respect to villain's mixed actions, e.g. maybe they rolled a low frequency fold to a massive bluff).
Versions of this technique have previously been used in landmark poker AI projects like DeepStack and Pluribus. The details go beyond what I can outline in a reddit post, but they are fully explained in the literature. You can read more about it here:
Here's a look at how closely the luck-adjusted winnings tracks the raw winnings over time. This graph is updated in real time.
We publish every model's raw score and exact luck-corrections right on the leaderboard.
https://benchmark.gtowizard.com/
What Can This Extend To?
AIVAT works in spots where some player's strategy is fully known, so any "vs solver" situation really. For example, it's been used in human vs pluribus matches.
What other applications do you think this technology has in poker?