Classical Planning in MDP Heuristics: with a Little Help from Generalization

Authors

  • Andrey Kolobov University of Washington, Seattle
  • . Mausam University of Washington, Seattle
  • Daniel Weld University of Washington, Seattle

DOI:

https://doi.org/10.1609/icaps.v20i1.13424

Keywords:

Heuristics, MDPs, Probabilistic planning

Abstract

Computing a good policy in stochastic uncertain environments with unknown dynamics and reward model parameters is a challenging task. In a number of domains, ranging from space robotics to epilepsy management, it may be possible to have an initial training period when suboptimal performance is permitted. For such problems it is important to be able to identify when this training period is complete, and the computed policy can be used with high confidence in its future performance. A simple principled criteria for identifying when training has completed is when the error bounds on the value estimates of the current policy are sufficiently small that the optimal policy is fixed, with high probability. We present an upper bound on the amount of training data required to identify the optimal policy as a function of the unknown separation gap between the optimal and the next-best policy values. We illustrate with several small problems that by estimating this gap in an online manner, the number of training samples to provably reach optimality can be significantly lower than predicted offline using a Probably Approximately Correct framework that requires an input epsilon parameter.

Downloads

Published

2021-05-25

How to Cite

Kolobov, A., Mausam, ., & Weld, D. (2021). Classical Planning in MDP Heuristics: with a Little Help from Generalization. Proceedings of the International Conference on Automated Planning and Scheduling, 20(1), 97-104. https://doi.org/10.1609/icaps.v20i1.13424