Logarithmic Regret for Linear Markov Decision Processes with Adversarial Corruptions
DOI:
https://doi.org/10.1609/aaai.v39i21.34436Abstract
In this work, we study the logarithmic regret for reinforcement learning (RL) with linear function approximation and adversarial corruptions, in the formulation of linear Markov decision processes (MDPs). Specifically, we consider the case where there exist adversarial corruptions over the reward functions, and the total amount of the corruptions of each step h across all episodes K is bounded by a corruption level C ≥ 0. We propose an algorithm, double-weighted least-squares value iteration with UCB (DW-LSVI-UCB), which leverages weighted linear regressions to learn the (corrupted) unknown reward parameters and unknown transition parameters simultaneously. We prove that DW-LSVI-UCB attains an O( d2H4 log2(1+K/δ) gapmin + CdH2) regret (omitting the dependence on lower order terms), where d is the ambient dimension of the feature mapping, H is the horizon length, gapmin is the minimal sub-optimality gap, and K is the number of episodes. Additionally, when there are no adversarial corruptions over reward functions, the regret of our algorithm improves the previous best result by an O(dH/ log K) factor.Downloads
Published
2025-04-11
How to Cite
Zhao, C., Zhang, X., Wang, B., & Li, S. (2025). Logarithmic Regret for Linear Markov Decision Processes with Adversarial Corruptions. Proceedings of the AAAI Conference on Artificial Intelligence, 39(21), 22759–22767. https://doi.org/10.1609/aaai.v39i21.34436
Issue
Section
AAAI Technical Track on Machine Learning VII