Conservative and Greedy Approaches to Classification-Based Policy Iteration

Mohammad Ghavamzadeh; Alessandro Lazaric

doi:10.1609/aaai.v26i1.8304

Conservative and Greedy Approaches to Classification-Based Policy Iteration

Authors

Mohammad Ghavamzadeh INRIA Lille
Alessandro Lazaric INRIA Lille

DOI:

https://doi.org/10.1609/aaai.v26i1.8304

Keywords:

Reinforcement Learning, Approximate Dynamic Programming, Classification-based Policy Iteration

Abstract

The existing classification-based policy iteration (CBPI) algorithms can be divided into two categories: direct policy iteration (DPI) methods that directly assign the output of the classifier (the approximate greedy policy w.r.t.~the current policy) to the next policy, and conservative policy iteration (CPI) methods in which the new policy is a mixture distribution of the current policy and the output of the classifier. The conservative policy update gives CPI a desirable feature, namely the guarantee that the policies generated by this algorithm improve at each iteration. We provide a detailed algorithmic and theoretical comparison of these two classes of CBPI algorithms. Our results reveal that in order to achieve the same level of accuracy, CPI requires more iterations, and thus, more samples than the DPI algorithm. Furthermore, CPI may converge to suboptimal policies whose performance is not better than DPI's.

Downloads

Published

2021-09-20

How to Cite

Ghavamzadeh, M., & Lazaric, A. (2021). Conservative and Greedy Approaches to Classification-Based Policy Iteration. Proceedings of the AAAI Conference on Artificial Intelligence, 26(1), 914–920. https://doi.org/10.1609/aaai.v26i1.8304

Download Citation

Issue

Vol. 26 No. 1 (2012): Twenty-Sixth AAAI Conference on Artificial Intelligence

Section

AAAI Technical Track: Machine Learning

Conservative and Greedy Approaches to Classification-Based Policy Iteration

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information