Policy Zooming: Adaptive Discretization-based Infinite-Horizon Average-Reward Reinforcement Learning

Authors

  • Avik Kar Indian Institute of Science
  • Rahul Singh Indian Institute of Science

DOI:

https://doi.org/10.1609/aaai.v40i27.39412

Abstract

We study infinite-horizon average-reward reinforcement learning for continuous space Lipschitz Markov decision processes (MDPs) in which an agent can play policies from a given set Φ. The proposed algorithms efficiently explore the policy space by “zooming” into the “promising regions” of Φ, thereby achieving adaptivity gains in the performance. We upper bound the regret as O ̃(T^(1-d_(eff.)^(-1) ) ), where d_(eff.) = d_z^Φ+2 for our model-free algorithm PZRL-MF and d_(eff.) = 2d_S + d_z^Φ+ 3 for our model-based algorithm PZRL-MB. Here, d_S is the dimension of the state space, and d_z^Φ is the zooming dimension given a set of policies Φ. d_z^Φ is an alternative measure of the complexity of the problem, and it depends on the underlying MDP as well as on Φ. Hence, the proposed algorithms exhibit low regret in case the problem instance is benign and/or the agent competes against a low-complexity Φ (that has a small d_z^Φ). When specialized to the case of finite-dimensional policy space, we obtain that d_(eff.) scales as the dimension of this space under mild technical conditions; and also obtain d_(eff.) = 2, or equivalently O ̃(√T) regret for PZRL-MF, under a curvature condition on the average reward function that is commonly used in the multi-armed bandit (MAB) literature.

Published

2026-03-14

How to Cite

Kar, A., & Singh, R. (2026). Policy Zooming: Adaptive Discretization-based Infinite-Horizon Average-Reward Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 40(27), 22527-22535. https://doi.org/10.1609/aaai.v40i27.39412

Issue

Section

AAAI Technical Track on Machine Learning IV