Logarithmic Regret for Linear Markov Decision Processes with Adversarial Corruptions

Authors

  • Canzhe Zhao John Hopcroft Center for Computer Science, Shanghai Jiaotong University
  • Xiangcheng Zhang Weiyang College, Tsinghua University
  • Baoxiang Wang School of Data Science, The Chinese University of Hong Kong, Shenzhen
  • Shuai Li John Hopcroft Center for Computer Science, Shanghai Jiaotong University

DOI:

https://doi.org/10.1609/aaai.v39i21.34436

Abstract

In this work, we study the logarithmic regret for reinforcement learning (RL) with linear function approximation and adversarial corruptions, in the formulation of linear Markov decision processes (MDPs). Specifically, we consider the case where there exist adversarial corruptions over the reward functions, and the total amount of the corruptions of each step h across all episodes K is bounded by a corruption level C ≥ 0. We propose an algorithm, double-weighted least-squares value iteration with UCB (DW-LSVI-UCB), which leverages weighted linear regressions to learn the (corrupted) unknown reward parameters and unknown transition parameters simultaneously. We prove that DW-LSVI-UCB attains an O( d2H4 log2(1+K/δ) gapmin + CdH2) regret (omitting the dependence on lower order terms), where d is the ambient dimension of the feature mapping, H is the horizon length, gapmin is the minimal sub-optimality gap, and K is the number of episodes. Additionally, when there are no adversarial corruptions over reward functions, the regret of our algorithm improves the previous best result by an O(dH/ log K) factor.

Downloads

Published

2025-04-11

How to Cite

Zhao, C., Zhang, X., Wang, B., & Li, S. (2025). Logarithmic Regret for Linear Markov Decision Processes with Adversarial Corruptions. Proceedings of the AAAI Conference on Artificial Intelligence, 39(21), 22759–22767. https://doi.org/10.1609/aaai.v39i21.34436

Issue

Section

AAAI Technical Track on Machine Learning VII