MCTS Based on Simple Regret

Authors

  • David Tolpin Ben-Gurion University of the Negev
  • Solomon Shimony Ben-Gurion University of the Negev

DOI:

https://doi.org/10.1609/aaai.v26i1.8126

Keywords:

search, MCTS, UCT, metareasoning, VOI

Abstract

UCT, a state-of-the art algorithm for Monte Carlo tree search (MCTS) in games and Markov decision processes, is based on UCB, a sampling policy for the Multi-armed Bandit problem (MAB) that minimizes the cumulative regret. However, search differs from MAB in that in MCTS it is usually only the final ``arm pull'' (the actual move selection) that collects a reward, rather than all ``arm pulls''. Therefore, it makes more sense to minimize the simple regret, as opposed to the cumulative regret. We begin by introducing policies for multi-armed bandits with lower finite-time and asymptotic simple regret than UCB, using it to develop a two-stage scheme (SR+CR) for MCTS which outperforms UCT empirically. Optimizing the sampling process is itself a metareasoning problem, a solution of which can use value of information (VOI) techniques. Although the theory of VOI for search exists, applying it to MCTS is non-trivial, as typical myopic assumptions fail. Lacking a complete working VOI theory for MCTS, we nevertheless propose a sampling scheme that is ``aware'' of VOI, achieving an algorithm that in empirical evaluation outperforms both UCT and the other proposed algorithms.

Downloads

Published

2021-09-20

How to Cite

Tolpin, D., & Shimony, S. (2021). MCTS Based on Simple Regret. Proceedings of the AAAI Conference on Artificial Intelligence, 26(1), 570-576. https://doi.org/10.1609/aaai.v26i1.8126

Issue

Section

Constraints, Satisfiability, and Search