Efficient Contextual Bandit Learning via Reward-Space Sampling and Online Optimization (Student Abstract)

Authors

  • Egor Suraveikin Lomonosov Moscow State University
  • Dastan Omirzak Moscow Independent Research Institute of Artificial Intelligence
  • Roman Sultimov Lomonosov Moscow State University Moscow Independent Research Institute of Artificial Intelligence
  • Yury Maximov Interdata LLC

DOI:

https://doi.org/10.1609/aaai.v40i48.42284

Abstract

The contextual multi-armed bandit problem underlies applications in recommendations, e-commerce, finance, and healthcare, where balancing exploration and exploitation is critical. While algorithms such as Upper Confidence Bound (UCB) and Thompson Sampling (TS) achieve strong theoretical guarantees, they often incur heavy computational cost from high-dimensional parameter estimation. We propose a new approach that combines reward sampling with online stochastic optimization. At each round, the algorithm samples hypothetical rewards for all actions and selects the action with the largest draw; the observed reward then updates the model via stochastic optimization. This design is both simple and efficient, preserving exploration while avoiding the pitfalls of greedy behavior on near-duplicate arms. Across synthetic and real-world datasets, our method attains near-optimal reward more quickly and with substantially lower computation than TS and UCB, demonstrating that sampling directly in reward space can improve both statistical efficiency and scalability.

Downloads

Published

2026-03-14

How to Cite

Suraveikin, E., Omirzak, D., Sultimov, R., & Maximov, Y. (2026). Efficient Contextual Bandit Learning via Reward-Space Sampling and Online Optimization (Student Abstract). Proceedings of the AAAI Conference on Artificial Intelligence, 40(48), 41398–41399. https://doi.org/10.1609/aaai.v40i48.42284