Optimism in Face of a Context:Regret Guarantees for Stochastic Contextual MDP

Authors

  • Orin Levy Tel Aviv University
  • Yishay Mansour Tel Aviv University and Google Research

DOI:

https://doi.org/10.1609/aaai.v37i7.26025

Keywords:

ML: Reinforcement Learning Theory, ML: Reinforcement Learning Algorithms, ML: Online Learning & Bandits

Abstract

We present regret minimization algorithms for stochastic contextual MDPs under minimum reachability assumption, using an access to an offline least square regression oracle. We analyze three different settings: where the dynamics is known, where the dynamics is unknown but independent of the context and the most challenging setting where the dynamics is unknown and context-dependent. For the latter, our algorithm obtains regret bound (up to poly-logarithmic factors) of order (H+1/pₘᵢₙ)H|S|³ᐟ²(|A|Tlog(max{|?|,|?|} /?))¹ᐟ² with probability 1−?, where ? and ? are finite and realizable function classes used to approximate the dynamics and rewards respectively, pₘᵢₙ is the minimum reachability parameter, S is the set of states, A the set of actions, H the horizon, and T the number of episodes. To our knowledge, our approach is the first optimistic approach applied to contextual MDPs with general function approximation (i.e., without additional knowledge regarding the function class, such as it being linear and etc.). We present a lower bound of ?((TH|S||A|ln|?| /ln|A| )¹ᐟ² ), on the expected regret which holds even in the case of known dynamics. Lastly, we discuss an extension of our results to CMDPs without minimum reachability, that obtains order of T³ᐟ⁴ regret.

Downloads

Published

2023-06-26

How to Cite

Levy, O., & Mansour, Y. (2023). Optimism in Face of a Context:Regret Guarantees for Stochastic Contextual MDP. Proceedings of the AAAI Conference on Artificial Intelligence, 37(7), 8510-8517. https://doi.org/10.1609/aaai.v37i7.26025

Issue

Section

AAAI Technical Track on Machine Learning II