Extreme Value Monte Carlo Tree Search for Classical Planning

Masataro Asai; Stephen Wissow

doi:10.1609/aaai.v40i43.40934

Authors

Masataro Asai IBM Research / MIT-IBM Watson AI Lab
Stephen Wissow University of New Hampshire

DOI:

https://doi.org/10.1609/aaai.v40i43.40934

Abstract

Despite being successful in board games and reinforcement learning (RL), Monte Carlo Tree Search (MCTS) combined with Multi Armed Bandit (MAB) has seen limited success in domain-independent classical planning until recently. Previous work (Wissow and Asai, 2024) showed that UCB1, designed for bounded rewards, does not perform well as applied to cost-to-go estimates in classical planning, because cost-to-go estimates are unbounded, and showed improved performance using a Gaussian reward MAB instead. This paper further sharpens our understanding of ideal bandits for planning tasks. Existing work has two issues: first, Gaussian MABs under-specify the support of cost-to-go estimates as (-∞, ∞), which we can narrow down. Second, Full Bellman backup (Schulte and Keller, 2014) that backpropagates sample max/min lacks theoretical justification. We use Peaks-Over-Threashold Extreme Value Theory to resolve both issues at once, propose a new bandit algorithm (UCB1-Uniform). We formally prove its regret bound and empirically demonstrate its performance in classical planning.

Extreme Value Monte Carlo Tree Search for Classical Planning

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information