Policy Zooming: Adaptive Discretization-based Infinite-Horizon Average-Reward Reinforcement Learning

Avik Kar; Rahul Singh

doi:10.1609/aaai.v40i27.39412

Authors

Avik Kar Indian Institute of Science
Rahul Singh Indian Institute of Science

DOI:

https://doi.org/10.1609/aaai.v40i27.39412

Abstract

We study infinite-horizon average-reward reinforcement learning for continuous space Lipschitz Markov decision processes (MDPs) in which an agent can play policies from a given set Φ. The proposed algorithms efficiently explore the policy space by “zooming” into the “promising regions” of Φ, thereby achieving adaptivity gains in the performance. We upper bound the regret as O ̃(T^(1-d_(eff.)^(-1) ) ), where d_(eff.) = d_z^Φ+2 for our model-free algorithm PZRL-MF and d_(eff.) = 2d_S + d_z^Φ+ 3 for our model-based algorithm PZRL-MB. Here, d_S is the dimension of the state space, and d_z^Φ is the zooming dimension given a set of policies Φ. d_z^Φ is an alternative measure of the complexity of the problem, and it depends on the underlying MDP as well as on Φ. Hence, the proposed algorithms exhibit low regret in case the problem instance is benign and/or the agent competes against a low-complexity Φ (that has a small d_z^Φ). When specialized to the case of finite-dimensional policy space, we obtain that d_(eff.) scales as the dimension of this space under mild technical conditions; and also obtain d_(eff.) = 2, or equivalently O ̃(√T) regret for PZRL-MF, under a curvature condition on the average reward function that is commonly used in the multi-armed bandit (MAB) literature.

Policy Zooming: Adaptive Discretization-based Infinite-Horizon Average-Reward Reinforcement Learning

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information