Safe subgame solving: the breakthrough in computer poker
TL;DR: A breakthrough paper in computer poker is poorly written. I try to explain it better. Intended readership: People who care about technical details in obscure game-playing AI algorithms
In early 2017, Carnegie Mellon University researchers created an AI that finally beat top humans at 1 vs 1 poker. This happened to comparatively little fanfare when compared to DeepMind’s sucesses in Go.
I think there are two reasons for this. First, Poker wasn’t seen as a “AI holy grail” the way Go has been. Second, Libratus used what we could call “classical AI” techniques (an iterative algorithm called CFR) instead of deep learning, thus missing the deep learning hype.
That said, the main breakthrough that made Libratus is can be generalized fairly easily. But first, we must talk a little about the theory and architecture behind AlphaGo [1] and modern poker bots.
Games like Go or poker are thought of in terms of their game tree. Go’s tree is straightforward: At each node you have a branch per possible place you could put a stone at. Poker’s tree is a bit more complex because of hidden information, where situations with equivalent information are grouped together in information sets. Both trees are huge – $10^160$ for 1 vs 1 Texas Hold’em and $10^170$ for Go.
The problem in making bots for these games is the top of the tree; solving small sub-trees at the bottom is easy to do. AlphaGo and classic chess bots solved this problem by copying moves by top humans [2]. Poker bots haven’t found a similar method for the top of the tree. Instead, they use “abstraction methods”: lower the branching factor (reduce the number of allowed bets in the tree) and prune parts of the tree that are unhelpful as fast as possible. [3]
This is bad because the quality of the strategy degrades quickly down the tree if you are using a single algorithm to solve the whole tree. The solution Libratus brings to this problem is to resolve subtrees we get to at runtime for stronger strategies.
Resolving subgames you reach at runtime won’t work with the naive method in a game of imperfect information, because you have to carry assumptions about the missing information with you from the “top of the tree” strategy. In this case, resolving will simply overfit your resulting strategy to those (very likely incorrect) assumptions about the opponent’s behavior. This naive resolving can, in theory, harm your strategy’s results.
The solution to this problem is quite elegant [4]. You resolve the subgame in the most adversarial environment possible. To do that, say we are resolving a subgame where the root node is our action. We first add a new root node above the original subgame root node of the subgame which is an opponent’s action. In this new node, the opponent gets to choose which private poker hand he gets to the subgame with [5].
Then, we get the value of the option to “opt out” of the subgame for each private hand of the opposing player, which should net him the same payoff as if he did in the tree above the subgame. We substract those values from his payoffs if he chooses to enter the subgame with that particular hand. This turns out to be equivalent to solving the minimax of the game taking in account the payoffs of the opponent from one level above, which is “safe” in that it can only improve a given strategy for a subgame.
The great thing about this trick is that it’s applicable generally to imperfect information games – so I would expect to see it used in AlphaZero type approaches for imperfect information games in the future.
The 2016 original alphaGo. We’ll talk about the alphaZero algorithm another day. ↩︎
With opening books by chess bots and supervised learning on human expert games for alphaGo ↩︎
One notable exception to this has been the deepstack bot which works similarly to alphago using supervised learning at the top of the tree. ↩︎
If you can discern it in this sadly very poorly written paper ↩︎
Formally, the opponent chooses the particular information set the subgame starts at. He should converge to a mixed strategy choosing between his un-dominated information sets ↩︎