Efficient Contextual Bandit Learning via Reward-Space Sampling and Online Optimization (Student Abstract)

Egor Suraveikin; Dastan Omirzak; Roman Sultimov; Yury Maximov

doi:10.1609/aaai.v40i48.42284

Efficient Contextual Bandit Learning via Reward-Space Sampling and Online Optimization (Student Abstract)

Authors

Egor Suraveikin Lomonosov Moscow State University
Dastan Omirzak Moscow Independent Research Institute of Artificial Intelligence
Roman Sultimov Lomonosov Moscow State University Moscow Independent Research Institute of Artificial Intelligence
Yury Maximov Interdata LLC

DOI:

https://doi.org/10.1609/aaai.v40i48.42284

Abstract

The contextual multi-armed bandit problem underlies applications in recommendations, e-commerce, finance, and healthcare, where balancing exploration and exploitation is critical. While algorithms such as Upper Confidence Bound (UCB) and Thompson Sampling (TS) achieve strong theoretical guarantees, they often incur heavy computational cost from high-dimensional parameter estimation. We propose a new approach that combines reward sampling with online stochastic optimization. At each round, the algorithm samples hypothetical rewards for all actions and selects the action with the largest draw; the observed reward then updates the model via stochastic optimization. This design is both simple and efficient, preserving exploration while avoiding the pitfalls of greedy behavior on near-duplicate arms. Across synthetic and real-world datasets, our method attains near-optimal reward more quickly and with substantially lower computation than TS and UCB, demonstrating that sampling directly in reward space can improve both statistical efficiency and scalability.

AAAI-26 / IAAI-26 / EAAI-26 Proceedings Cover

Downloads

Published

2026-03-14

How to Cite

Suraveikin, E., Omirzak, D., Sultimov, R., & Maximov, Y. (2026). Efficient Contextual Bandit Learning via Reward-Space Sampling and Online Optimization (Student Abstract). Proceedings of the AAAI Conference on Artificial Intelligence, 40(48), 41398–41399. https://doi.org/10.1609/aaai.v40i48.42284

Download Citation

Issue

Vol. 40 No. 48: EAAI-26 AI for Education, Model AI Assignments, AAAI-26 Emerging Trends, Doctoral Consortium, Student Abstracts, Undergraduate Consortium and Demonstrations

Section

AAAI Student Abstract and Poster Program

Efficient Contextual Bandit Learning via Reward-Space Sampling and Online Optimization (Student Abstract)

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information