TY  - JOUR
AU  - Xu, Zhe
AU  - Gavran, Ivan
AU  - Ahmad, Yousef
AU  - Majumdar, Rupak
AU  - Neider, Daniel
AU  - Topcu, Ufuk
AU  - Wu, Bo
PY  - 2020/06/01
Y2  - 2024/04/19
TI  - Joint Inference of Reward Machines and Policies for Reinforcement Learning
JF  - Proceedings of the International Conference on Automated Planning and Scheduling
JA  - ICAPS
VL  - 30
IS  - 1
SE  - Planning and Learning
DO  - 10.1609/icaps.v30i1.6756
UR  - https://ojs.aaai.org/index.php/ICAPS/article/view/6756
SP  - 590-598
AB  - &lt;p&gt;Incorporating &lt;em&gt;high-level knowledge&lt;/em&gt; is an effective way to expedite reinforcement learning (RL), especially for complex tasks with sparse rewards. We investigate an RL problem where the high-level knowledge is in the form of &lt;em&gt;reward machines&lt;/em&gt;, a type of Mealy machines that encode non-Markovian reward functions. We focus on a setting in which this knowledge is &lt;em&gt;a priori&lt;/em&gt; not available to the learning agent. We develop an iterative algorithm that performs joint inference of reward machines and policies for RL (more specifically, q-learning). In each iteration, the algorithm maintains a &lt;em&gt;hypothesis&lt;/em&gt; reward machine and a &lt;em&gt;sample&lt;/em&gt; of RL episodes. It uses a separate q-function defined for each state of the current hypothesis reward machine to determine the policy and performs RL to update the q-functions. While performing RL, the algorithm updates the sample by adding RL episodes along which the obtained rewards are inconsistent with the rewards based on the current hypothesis reward machine. In the next iteration, the algorithm infers a new hypothesis reward machine from the updated sample. Based on an &lt;em&gt;equivalence&lt;/em&gt; relation between states of reward machines, we transfer the q-functions between the hypothesis reward machines in consecutive iterations. We prove that the proposed algorithm converges almost surely to an optimal policy in the limit. The experiments show that learning high-level knowledge in the form of reward machines leads to fast convergence to optimal policies in RL, while the baseline RL methods fail to converge to optimal policies after a substantial number of training steps.&lt;/p&gt;
ER  -