MCTS Based on Simple Rerget

Authors

  • David Tolpin Ben Gurion University of the Negev
  • Solomon Shimony Ben-Gurion University of the Negev

DOI:

https://doi.org/10.1609/socs.v3i1.18221

Keywords:

mcts, uct, voi, metareasoning

Abstract

UCT, a state-of-the art algorithm for Monte Carlo tree search (MCTS),is based on UCB, a policy for the Multi-armed Bandit problem (MAB) thatminimizes the cumulative regret. However, search differs from MAB inthat in MCTS it is usually only the final ``arm pull''that collects a reward, rather than all ``arm pulls''.Therefore, it makes more sense to minimize the simple, rather thancumulative, regret. We introduce policies formulti-armed bandits with lower simpleregret than UCB and develop a two-stage scheme (SR+CR) for MCTSwhich outperforms UCT empirically. We also propose a samplingscheme based on value of information (VOI), achieving an algorithmthat empirically outperforms other proposed algorithms.

Downloads

Published

2021-08-20

Issue

Section

Extended Abstracts of Papers Presented Elsewhere