Confidence Backup Updates for Aggregating MDP State Values in Monte-Carlo Tree Search

Zahy Bnaya; Alon Palombo; Rami Puzis; Ariel Felner

doi:10.1609/socs.v6i1.18378

Authors

Zahy Bnaya New York University
Alon Palombo Ben-Gurion University
Rami Puzis Ben-Gurion University
Ariel Felner Ben-Gurion University

DOI:

https://doi.org/10.1609/socs.v6i1.18378

Abstract

Monte-Carlo Tree Search (MCTS) algorithms estimate the value of MDP states based on rewards received by performing multiple random simulations. MCTS algorithms can use different strategies to aggregate these rewards and provide an estimation for the states’ values. The most common aggregation method is to store the mean reward of all simulations. Another common approach stores the best observed reward from each state. Both of these methods have complementary benefits and drawbacks. In this paper, we show that both of these methods are biased estimators for the real expected value of MDP states. We propose an hybrid approach that uses the best reward for states with low noise, and otherwise uses the mean. Experimental results on the Sailing MDP domain show that our method has a considerable advantage when the rewards are drawn from a noisy distribution.

Confidence Backup Updates for Aggregating MDP State Values in Monte-Carlo Tree Search

Authors

DOI:

Abstract

Downloads

Published

Issue

Section