Skip to content

Monte Carlo Tree Search

Monte Carlo Tree Search (MCTS) is a search technique in the field of Artificial Intelligence (AI). It is a probabilistic and heuristic driven search algorithm that combines the classic tree search implementations alongside machine learning principles of reinforcement learning. In tree search, there’s always the possibility that the current best action is actually not the most optimal action. In such cases, MCTS algorithm becomes useful as it continues to evaluate other alternatives periodically during the learning phase by executing them, instead of the current perceived optimal strategy. This is known as the ”exploration-exploitation trade-off“. It exploits the actions and strategies that is found to be the best till now but also must continue to explore the local space of alternative decisions and find out if they could replace the current best.

Exploration helps in exploring and discovering the unexplored parts of the tree, which could result in finding a more optimal path. In other words, we can say that exploration expands the tree’s breadth more than its depth. Exploration can be useful to ensure that MCTS is not overlooking any potentially better paths. But it quickly becomes inefficient in situations with large number of steps or repetitions. In order to avoid that, it is balanced out by exploitation. Exploitation sticks to a single path that has the greatest estimated value. This is a greedy approach and this will extend the tree’s depth more than its breadth. In simple words, UCB formula applied to trees helps to balance the exploration-exploitation trade-off by periodically exploring relatively unexplored nodes of the tree and discovering potentially more optimal paths than the one it is currently exploiting. For this characteristic, MCTS becomes particularly useful in making optimal decisions in Artificial Intelligence (AI) problems.

Algorithm

At its core, MCTS consists of repeated iterations (ideally infinite, in practice constrained by computing time and resources) of 4 steps: selection, expansion, simulation and backup.

Selection: In this process, the MCTS algorithm traverses the current tree from the root node using a specific strategy. The strategy uses an evaluation function to optimally select nodes with the highest estimated value. MCTS uses the Upper Confidence Bound (UCB) formula applied to trees (UCT) as the strategy in the selection process to traverse the tree. It balances the exploration-exploitation trade-off. During tree traversal, a node is selected based on some parameters that return the maximum value. The parameters are characterized by the formula that is typically used for this purpose is given below.

\[UCB(node_i) = \frac{w_i}{n_i} + c \cdot \sqrt{\frac{ln(N)}{n_i}}\]

where:

  • \(w_i\) : cumulated node's value (reward)
  • \(n_i\) : number of time the node has been visited
  • \(c\) : confidence value that controls the level of exploration
  • \(N\) : total number of times that the node's parent has been visited

The formula for UCB can be thought of as being formed from 2 distinct parts:

Exploitation

\(\frac{w_i}{n_i}\) represents the exploitation part of the equation. UCB is based on the principle of “optimism in the fact of uncertainty”, which basically means if you don’t know which action is best then choose the one that currently looks to be the best. Taking this half of the equation by itself will do exactly that: the action that currently has the highest estimated reward will be the chosen action.

Exploration

The second half of the equation adds exploration, with the degree of exploration being controlled by the hyper-parameter \(c\). Effectively this part of the equation provides a measure of the uncertainty for the action’s reward estimate.

If an action hasn’t been tried very often, or not at all, then \(n_i\) will be small. Consequently the uncertainty term will be large, making this action more likely to be selected. Every time an action is taken we become more confident about its estimate. In this case \(n_i\) increments, and so the uncertainty term decreases, making it less likely that this action will be selected as a result of exploration (although it may still be selected as the action with the highest value, due to the exploitation term).

When an action is not being selected, the uncertainty term will grow slowly, due to the log function in the numerator. Whereas, every time that the action is selected, the uncertainty will shrink rapidly due to the increase in \(n_i\) being linear. So the exploration term will be larger for actions that have been selected infrequently, due to the uncertainty in the estimates of their rewards. As time progresses the exploration term gradually decreases (since as \(n_i\) goes to infinity \(\frac{ln(N)}{n_i}\) goes to zero), until eventually actions are selected based only on the exploitation term.

When traversing a tree during the selection process, the child node that returns the greatest value from the above equation will be one that will get selected. During traversal, once a child node is found which is also a leaf node, the MCTS jumps into the expansion step.

Note

Exploitation vs Exploration

The UCB formula balances the exploitation of known rewards with the exploration of relatively unvisited nodes to encourage their exercise. Reward estimates are based on random simulations, so nodes must be visited a number of times before these estimates become reliable; MCTS estimates will typically be unreliable at the start of a search but converge to more reliable estimates given sufficient time and perfect estimates given infinite time.

Expansion: In this process, a new child node is added to the tree to that node which was optimally reached during the selection process.

Simulation: In this process, a simulation is performed by choosing moves or strategies until a result or predefined state is achieved.

Backpropagation: After determining the value of the newly added node, the remaining tree must be updated. So, the backpropagation process is performed, where it backpropagates from the new node to the root node. During the process, the number of simulation stored in each node is incremented. Also, if the new node’s simulation results in a win, then the number of wins is also incremented.

Benefits

MCTS offers a number of advantages over traditional tree search methods.

Aheuristic MCTS does not require any strategic or tactical knowledge about the given domain to make reasonable decisions. The algorithm can function effectively with no knowledge of a game apart from its legal moves and end conditions; this means that a single MCTS implementation can be reused for a number of games with little modification, and makes MCTS a potential boon for general game playing.

Asymmetric MCTS performs asymmetric tree growth that adapts to the topology of the search space. The algorithm visits more interesting nodes more often, and focusses its search time in more relevant parts of the tree.

This makes MCTS suitable for games with large branching factors such as 19x19 Go. Such large combinatorial spaces typically cause problems for standard depth- or breadth-based search methods, but the adaptive nature of MCTS means that it will (eventually) find those moves that appear optimal and focus its search effort there.

Anytime The algorithm can be halted at any time to return the current best estimate. The search tree built thus far may be discarded or preserved for future reuse.