MCTS Based on Simple Regret

David Tolpin; Solomon Shimony

doi:10.1609/aaai.v26i1.8126

Authors

David Tolpin Ben-Gurion University of the Negev
Solomon Shimony Ben-Gurion University of the Negev

DOI:

https://doi.org/10.1609/aaai.v26i1.8126

Keywords:

search, MCTS, UCT, metareasoning, VOI

Abstract

UCT, a state-of-the art algorithm for Monte Carlo tree search (MCTS) in games and Markov decision processes, is based on UCB, a sampling policy for the Multi-armed Bandit problem (MAB) that minimizes the cumulative regret. However, search differs from MAB in that in MCTS it is usually only the final ``arm pull'' (the actual move selection) that collects a reward, rather than all ``arm pulls''. Therefore, it makes more sense to minimize the simple regret, as opposed to the cumulative regret. We begin by introducing policies for multi-armed bandits with lower finite-time and asymptotic simple regret than UCB, using it to develop a two-stage scheme (SR+CR) for MCTS which outperforms UCT empirically. Optimizing the sampling process is itself a metareasoning problem, a solution of which can use value of information (VOI) techniques. Although the theory of VOI for search exists, applying it to MCTS is non-trivial, as typical myopic assumptions fail. Lacking a complete working VOI theory for MCTS, we nevertheless propose a sampling scheme that is ``aware'' of VOI, achieving an algorithm that in empirical evaluation outperforms both UCT and the other proposed algorithms.

MCTS Based on Simple Regret

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription