LRTDP Versus UCT for Online Probabilistic Planning

Andrey Kolobov; . Mausam; Daniel Weld

doi:10.1609/aaai.v26i1.8362

Authors

Andrey Kolobov University of Washington, Seattle
. Mausam University of Washington, Seattle
Daniel Weld University of Washington, Seattle

DOI:

https://doi.org/10.1609/aaai.v26i1.8362

Keywords:

Markov Decision Process, UCT, LRTDP, Online planning

Abstract

UCT, the premier method for solving games such as Go, is also becoming the dominant algorithm for probabilistic planning. Out of the five solvers at the International Probabilistic Planning Competition (IPPC) 2011, four were based on the UCT algorithm. However, while a UCT-based planner, PROST, won the contest, an LRTDP-based system, Glutton, came in a close second, outperforming other systems derived from UCT. These results raise a question: what are the strengths and weaknesses of LRTDP and UCT in practice? This paper starts answering this question by contrasting the two approaches in the context of finite-horizon MDPs. We demonstrate that in such scenarios, UCT's lack of a sound termination condition is a serious practical disadvantage. In order to handle an MDP with a large finite horizon under a time constraint, UCT forces an expert to guess a non-myopic lookahead value for which it should be able to converge on the encountered states. Mistakes in setting this parameter can greatly hurt UCT's performance. In contrast, LRTDP's convergence criterion allows for an iterative deepening strategy. Using this strategy, LRTDP automatically finds the largest lookahead value feasible under the given time constraint. As a result, LRTDP has better performance and stronger theoretical properties. We present an online version of Glutton, named Gourmand, that illustrates this analysis and outperforms PROST on the set of IPPC-2011 problems.

LRTDP Versus UCT for Online Probabilistic Planning

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information