Forced Exploration in Bandit Problems

Authors

  • Qi Han Xi'an Jiaotong University
  • Li Zhu Xi'an Jiaotong University
  • Fei Guo Xi'an Jiaotong University

DOI:

https://doi.org/10.1609/aaai.v38i11.29117

Keywords:

ML: Online Learning & Bandits, ML: Learning Theory

Abstract

The multi-armed bandit(MAB) is a classical sequential decision problem. Most work requires assumptions about the reward distribution (e.g., bounded), while practitioners may have difficulty obtaining information about these distributions to design models for their problems, especially in non-stationary MAB problems. This paper aims to design a multi-armed bandit algorithm that can be implemented without using information about the reward distribution while still achieving substantial regret upper bounds. To this end, we propose a novel algorithm alternating between greedy rule and forced exploration. Our method can be applied to Gaussian, Bernoulli and other subgaussian distributions, and its implementation does not require additional information. We employ a unified analysis method for different forced exploration strategies and provide problem-dependent regret upper bounds for stationary and piecewise-stationary settings. Furthermore, we compare our algorithm with popular bandit algorithms on different reward distributions.

Published

2024-03-24

How to Cite

Han, Q., Zhu, L., & Guo, F. (2024). Forced Exploration in Bandit Problems. Proceedings of the AAAI Conference on Artificial Intelligence, 38(11), 12270–12277. https://doi.org/10.1609/aaai.v38i11.29117

Issue

Section

AAAI Technical Track on Machine Learning II