Logarithmic Regret for Linear Markov Decision Processes with Adversarial Corruptions

Canzhe Zhao; Xiangcheng Zhang; Baoxiang Wang; Shuai Li

doi:10.1609/aaai.v39i21.34436

Authors

Canzhe Zhao John Hopcroft Center for Computer Science, Shanghai Jiaotong University
Xiangcheng Zhang Weiyang College, Tsinghua University
Baoxiang Wang School of Data Science, The Chinese University of Hong Kong, Shenzhen
Shuai Li John Hopcroft Center for Computer Science, Shanghai Jiaotong University

DOI:

https://doi.org/10.1609/aaai.v39i21.34436

Abstract

In this work, we study the logarithmic regret for reinforcement learning (RL) with linear function approximation and adversarial corruptions, in the formulation of linear Markov decision processes (MDPs). Specifically, we consider the case where there exist adversarial corruptions over the reward functions, and the total amount of the corruptions of each step h across all episodes K is bounded by a corruption level C ≥ 0. We propose an algorithm, double-weighted least-squares value iteration with UCB (DW-LSVI-UCB), which leverages weighted linear regressions to learn the (corrupted) unknown reward parameters and unknown transition parameters simultaneously. We prove that DW-LSVI-UCB attains an O( d2H4 log2(1+K/δ) gapmin + CdH2) regret (omitting the dependence on lower order terms), where d is the ambient dimension of the feature mapping, H is the horizon length, gapmin is the minimal sub-optimality gap, and K is the number of episodes. Additionally, when there are no adversarial corruptions over reward functions, the regret of our algorithm improves the previous best result by an O(dH/ log K) factor.

Logarithmic Regret for Linear Markov Decision Processes with Adversarial Corruptions

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information