Reinforcement Learning Augmented Asymptotically Optimal Index Policy for Finite-Horizon Restless Bandits

Authors

  • Guojun Xiong SUNY-Binghamton University
  • Jian Li SUNY-Binghamton University
  • Rahul Singh Indian Institute of Science

DOI:

https://doi.org/10.1609/aaai.v36i8.20852

Keywords:

Machine Learning (ML)

Abstract

We study a finite-horizon restless multi-armed bandit problem with multiple actions, dubbed as R(MA)^2B. The state of each arm evolves according to a controlled Markov decision process (MDP), and the reward of pulling an arm depends on both the current state and action of the corresponding MDP. Since finding the optimal policy is typically intractable, we propose a computationally appealing index policy entitled Occupancy-Measured-Reward Index Policy for the finite-horizon R(MA)^2B. Our index policy is well-defined without the requirement of indexability condition and is provably asymptotically optimal as the number of arms tends to infinity. We then adopt a learning perspective where the system parameters are unknown, and propose R(MA)^2B-UCB, a generative model based reinforcement learning augmented algorithm that can fully exploit the structure of Occupancy-Measured-Reward Index Policy. Compared to existing algorithms, R(MA)^2B-UCB performs close to offline optimum, and achieves a sub-linear regret and a low computational complexity all at once. Experimental results show that R(MA)^2B-UCB outperforms existing algorithms in both regret and running time.

Downloads

Published

2022-06-28

How to Cite

Xiong, G., Li, J., & Singh, R. (2022). Reinforcement Learning Augmented Asymptotically Optimal Index Policy for Finite-Horizon Restless Bandits. Proceedings of the AAAI Conference on Artificial Intelligence, 36(8), 8726-8734. https://doi.org/10.1609/aaai.v36i8.20852

Issue

Section

AAAI Technical Track on Machine Learning III