Quantum Multi-Armed Bandits and Stochastic Linear Bandits Enjoy Logarithmic Regrets

Authors

  • Zongqi Wan Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences
  • Zhijie Zhang Fuzhou University
  • Tongyang Li Peking University
  • Jialin Zhang Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences
  • Xiaoming Sun Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences

DOI:

https://doi.org/10.1609/aaai.v37i8.26202

Keywords:

ML: Quantum Machine Learning, ML: Online Learning & Bandits

Abstract

Multi-arm bandit (MAB) and stochastic linear bandit (SLB) are important models in reinforcement learning, and it is well-known that classical algorithms for bandits with time horizon T suffer from the regret of at least the square root of T. In this paper, we study MAB and SLB with quantum reward oracles and propose quantum algorithms for both models with the order of the polylog T regrets, exponentially improving the dependence in terms of T. To the best of our knowledge, this is the first provable quantum speedup for regrets of bandit problems and in general exploitation in reinforcement learning. Compared to previous literature on quantum exploration algorithms for MAB and reinforcement learning, our quantum input model is simpler and only assumes quantum oracles for each individual arm.

Downloads

Published

2023-06-26

How to Cite

Wan, Z., Zhang, Z., Li, T., Zhang, J., & Sun, X. (2023). Quantum Multi-Armed Bandits and Stochastic Linear Bandits Enjoy Logarithmic Regrets. Proceedings of the AAAI Conference on Artificial Intelligence, 37(8), 10087-10094. https://doi.org/10.1609/aaai.v37i8.26202

Issue

Section

AAAI Technical Track on Machine Learning III