Efficient Contextual Bandit Learning via Reward-Space Sampling and Online Optimization (Student Abstract)
DOI:
https://doi.org/10.1609/aaai.v40i48.42284Abstract
The contextual multi-armed bandit problem underlies applications in recommendations, e-commerce, finance, and healthcare, where balancing exploration and exploitation is critical. While algorithms such as Upper Confidence Bound (UCB) and Thompson Sampling (TS) achieve strong theoretical guarantees, they often incur heavy computational cost from high-dimensional parameter estimation. We propose a new approach that combines reward sampling with online stochastic optimization. At each round, the algorithm samples hypothetical rewards for all actions and selects the action with the largest draw; the observed reward then updates the model via stochastic optimization. This design is both simple and efficient, preserving exploration while avoiding the pitfalls of greedy behavior on near-duplicate arms. Across synthetic and real-world datasets, our method attains near-optimal reward more quickly and with substantially lower computation than TS and UCB, demonstrating that sampling directly in reward space can improve both statistical efficiency and scalability.Downloads
Published
2026-03-14
How to Cite
Suraveikin, E., Omirzak, D., Sultimov, R., & Maximov, Y. (2026). Efficient Contextual Bandit Learning via Reward-Space Sampling and Online Optimization (Student Abstract). Proceedings of the AAAI Conference on Artificial Intelligence, 40(48), 41398–41399. https://doi.org/10.1609/aaai.v40i48.42284
Issue
Section
AAAI Student Abstract and Poster Program