Kyle Hughart
Computer Science
University of Wisconsin Platteville
hughartk@uwplatt.edu
Abstract
Many challenging computer opponents have been developed for finite-state, two-player board games using techniques such as minimax trees, alpha-beta pruning, and various heuristics. These tactics have enjoyed success in games such as Tic-Tac-Toe, checkers, and chess. Some games, however, have very large numbers of possible games, and are computationally infeasible to analyze using these methods. The Monte Carlo tree search is a method wherein the computer opponent estimates the best moves based on random or partially random game simulations that make up only a small portion of all possible games.
Introduction
Many two-player, finite-state board games such as Tic-Tac-Toe, Chess, and Checkers can be represented in the form of a tree, with the root at the beginning board state. Games with very large numbers of possible games are very “deep” games and have very large trees. Tic-Tac-Toe has about 105 possible games. Checkers has 1050, and Chess, 10123. Go, a popular Japanese board game with simple rules but vast possibilities, has 10360. Games like Tic-Tac-Toe can easily be analyzed in their entirety and optimal win strategies can be formulated, but this is not computationally feasible for games like Go [5].
Conventional methods
AI for simple games like Tic-Tac-Toe can be created using methods such as a minimax tree, which is named after the presence of a “maximizing player” and a “minimizing player”. A minimax tree firstly generates a tree of all possible game sequences (with the root being a blank board). The leaves of the tree are given a value. For shallow games like Tic-Tac-Toe, the leaves will most likely all be end games, given a win value or a loss value. For deeper games, a heuristic evaluation function may be used instead. Then, the next deepest level of nodes are assigned value of either the maximum or the minimum child, depending on whether it is the maximizing or minimizing players move. This method assumes that each player will take the move most advantageous to them.
Figure 1: Mimiax Tree (Trinity College 2011)
For deeper games like Go, a minimax tree can still be used, but alpha-beta pruning may be used to “prune off” parts of the tree that are clearly not going to beat solutions that have already been discovered, rather than evaluating those solutions fully [6]. However, even with alpha-beta pruning, game trees like those generated by Go are far too deep for competitive computer opponents to be created using these methods.
What is the Monte Carlo Method?
To solve the problem above, let’s first discuss the Monte Carlo Method, and then show how it applies to tree searches. Broadly, the Monte Carlo method could be considered the practice of estimating a solution based on a number of random trials. For example, one might estimate the area of an irregular figure by inscribing it in a square and randomly distributing points inside the square. The ratio of points that fall outside the figure to inside should give an approximation of the ratio of the square’s area to the figure’s area.
Figure 2: Area Estimation with Monte Carlo Method (Hiob 1996)
In Figure 2, we scatter 1000 points inside the square, 280 of which fall inside the shape. We can thus conclude that the area of the shape is approximately 28% of the area of the square, which can be easily calculated. Now imagine we were playing a game where we saw 10 squares like this, all with different irregular shapes. We want to choose the one with the biggest irregular shape. We could use the Monte Carlo Method to make a good guess about the right choice. The more dots we use, the better.
Given a game with a finite number of possible moves from turn to turn (such as chess or checkers), a “vanilla” implementation of the Monte Carlo method would be to simulate a large number of purely random games before making a move. These simulations would have to understand what legal moves were available each turn, and would need to know when a win or lose condition was reached, but would require no knowledge of good tactics or advantageous positions. Each play-through would be similar to a “dot” in Figure 2, and the ones that fall in the shape would be winning choices. Then, each available opening move would be ranked in terms of how frequently it won in the simulations. New simulations could be carried out after each successive move.
Although the above technique will yield some strategic play, despite having no direct knowledge of strategy, it is not typically used. It fails to take into account even the most basic aspects of strategy such as the fact that the opponent will strive to bring about wins. It will make the wrong move in situations such as this one:
Figure 3: Random Rollout Weakness (Browne 2011)
In the above game, victory is achieved by getting five pieces of one’s color in a row. It is currently black’s turn to place a piece. It is clear that black must place a piece at position a to avoid losing, but a vanilla Monte Carlo method will choose b. This is because choosing b opens up both positions labeled x as winning positions for black on the next turn. Statistically, there are more winning games states on the path of move b, but in some circumstances, this clearly does not indicate the best move.
Modified rollout
Perhaps the most obvious solution to this problem is to modify the “rollout” or simulation. The rollout for a vanilla Monte Carlo approach makes purely random legal moves, but we can modify this to involve heuristics or rules. These can be domain specific, or they can simply involve checking for immediate winning or losing moves, and “pruning” those moves from the decision tree to assure they are not taken. A similar method would be adding a short minimax tree (a depth of two would be sufficient) to the end of the simulation.
Exploration vs. Exploitation
Another common improvement is to build a tree from the starting position, which is used to avoid less useful simulations. The mechanism for this is what’s known as the exploration vs. exploitation tradeoff. This is the phenomenon demonstrated by “k-armed bandit” problems.
Imagine a slot machine with multiple levers (a “bandit” with multiple “arms”) where each lever has its own set of possible rewards and associated probabilities of winning them. For example, one lever might pay only one dollar, and only pay 1 in 10 times. Another might pay anywhere from 1-10 dollars with equal likelihood. Another might pay 10 dollars 60% of the time and 3 dollars the rest of the time. The gambler starts with no initial knowledge of the possible payouts, and thus faces an exploration vs. exploitation tradeoff with each successive pull. Imagine one lever has been pulled 10 times and has been paying off nicely most of the time. Another lever has been pulled only once and it didn’t do very well at all. A third lever has not even been tried yet. The gambler must weigh between the unknown rewards of pulling the unknown lever, giving more tries to the once-pulled lever, or sticking with the reliable lever. That is, they have to weigh between “exploiting” the good lever, and “exploring” the uncertain ones [4].
UCT
A number of strategies have been proposed to balance exploration vs. exploitation in a way that maximizes expected payout. One method is to use an Upper Confidence Bounds formula, which can be shown as follows:
Figure 4: (Chaslot 2006)
When applied to deciding which nodes in a decision tree to simulate, we interpret the formula as follows: ct,s is a bias sequence which shows that is added to the usual ranking of a node. Recall that the ranking is established by its likelihood of leading to a winning game based on previous simulations. s is the number of the times the node has been visited and t is the total number of times the node’s parent has been visited. C is a “bias parameter” that can be adjusted best fit the application.
The value of “exploitation” is captured by the ranking, which shows how much luck we’ve had trying this node in the past, while the value of “exploration” is captured by the variable ct,s. We cannot visit a node without also visiting its parent (excluding the node we are currently deciding from), so t will always be greater than or equal to s. As we visit a parent node more and more times without visiting one of its particular children, the value of exploring that child node will grow, but it will grow more and more slowly as the number of visits to both parent and child increase, due to the logarithm in the numerator. Thus, the peak of our interest in exploring new nodes will happen early, and the peak of our interest in exploiting known good nodes will happen late, which is consistent with our intuition [4].
The use of this formula in a Monte Carlo Search Tree is called the UCT method, and was introduced by Levente Kocsis and Csaba Szepesvari in 2006. UCT can be broken into Selection, Expansion, Simulation, and Backpropogation.
Figure 5: Phases of UTC (Chaslot 2006)
When simulation first begins at the root of the game tree, the only known node is the root, so it is “selected”, and we move right away into expansion. For this step, we generate all possible immediate children of the current node (the root, in this case), and we select one at random. From that node, we run our simulation, also known as a “rollout”.
How far the rollout goes varies by implementation. In very deep trees, we may want to set a fixed depth, or start with a fixed depth and increase it with each pass. In board games, we may want to continue running the simulation until an end game is reached. Terminating before an end game is reached in games that keep track of score (such as Scrabble) can simply return the current score for ranking calculation. In games like Chess where there are only win and lose conditions instead of a progressive score, terminating early will require an “evaluation function” that use domain-specific heuristics to assess the desirability of the board state. After the simulation concludes, we use the results to update the ranking of the nodes we selected to reach this point (backpropagation).
The second simulation plays out the same way, but this time we use the UCB formula to decide whether to try the same child node again or try one of the others. We will then “expand” again from whichever node we select. This process is repeated over and over for as much time is allowed in the implementation for making a move until we have developed a link tree of game states with disproportionate knowledge about good tactics [3]. This is in contrast to a vanilla Monte Carlo approach which has equal knowledge of games resulting from both good tactics and bad tactics. UCT will choose the correct move in the 5-in-a-row example shown earlier, even if rollouts are completely random [2].
MoGo
“MoGo” is a top-level Go player on a 9x9 board. It was the first AI Go player to utilize the UCT method, The pseudo-code for a simulation in MoGo is as follows:
Figure 5: MoGo Pseudo-Code (Gelly 2006)
The process begins at the root (node[0]) on line 2. We see in line 4 the function call to “descendByUCB1” where a single “select” of the selection phase takes place, and then continues to occur until we reach a node that hasn’t yet been selected. The expansion step where we create a new node occurs at line 6. On lines 7 and 8, value is calculated and backpropogation begins. The “playOnceSequenceInMoGo” function would be executed repeatedly to build a Monte Carlo Tree.
MoGo does not tally a score, but simply treats winning results as 1’s and losing results as 0’s. Draws are ignored, as they are extremely rare in Go. This ranking system and lack of an evaluation function requires that simulations are carried out until the end of a game, rather than terminating at a given depth. Since UTC improves greatly when computation time increases, MoGo’s developers developed a parallelized implementation capable of running on a multi-processor machine with shared memory.
Figure 6: Improvement of UCT with Computing Time
Although MoGo’s developers found that using a modified rollout (non-random selection of moves during simulation) greatly improves the AI’s skill, the above statistics were taken using purely random rollouts in order to demonstrate the advantage a parallelized approach has for UCT [7].
Conclusions
The Monte Carlo Search tree is currently the dominating class of board game AI for deep games like Go, and can be expanded upon in a variety of ways that are both domain specific and non-domain specific. It can be adapted to games that do or do not have a persisting score from state to state (Scrabble vs. Chess), and regular methods such as a brief minimax tree search can be added to it in order to improve its play. It enjoys substantial improvements with increased computer power and parallelization, which give it a promising future as computing technology improves and the utilization of multi-core processors becomes more common. Its compactness, elegance, and reliability even in the absence of domain-specific knowledge make it a highly adaptable solution for a wide variety of decision problems that occupy a large state space.
References
[1] Hiob, E. (1996, December 7). BCIT Math Square :: Finding Areas Using Monte Carlo. BCIT Commons. Retrieved March 24, 2012, from http://commons.bcit.ca/math/entertainment/monte/index.html
[2] Browne, C. (2011). On the Dangers of Random Playouts. ICGA Journal, 34(1), 25-26. Retrieved March 22, 2012, from
http://pubs.doc.ic.ac.uk/dangers-random-playouts/dangers-random-playouts.pdf
[3] Chaslot, Guillaume et al. (October, 2008). Monte-Carlo Tree Search: A New Framework for Game AI. Unpublished paper presented at the Fourth Artificial Intelligence and Interactive Digital Entertainment Conference, Maastricht, The Netherlands.
[4] Kocsis L. & Szepesvari C. (September 2006). Bandit based Monte-Carlo Planning. Unpublished paper presented European Conference on Machine Learning, Berlin, Germany.
[5] Mertens, P.J.C.. "A Quoridor-playing Agent." Home - Maastricht University. Maastricht University, 21 June 2006. Web. 27 Mar. 2012.
.
[6] "Notes: Minimax & Alpha/Beta Pruning." Trinity College. Trinity College, 18 Oct. 2011. Web. 27 Mar. 2012. .
[7] Gelly, Sylvain et al. (December, 2006). Exploration exploitation in Go: UCT for Monte-Carlo Go. Unpublished paper presented at the On-line Trading of Exploration and Exploitation, Whistler, BC, Canada.
Share with your friends: |