Aligning Agent Policies with Preferences: Human-Centered Interpretable Reinforcement Learning

Stephanie Milani; Zhicheng Zhang; Nicholay Topin; Lirong Xia; Fei Fang

doi:10.1609/aies.v8i2.36668

Authors

Stephanie Milani Carnegie Mellon University
Zhicheng Zhang Carnegie Mellon University
Nicholay Topin Matchmatch
Lirong Xia DIMACS, Rutgers University
Fei Fang Carnegie Mellon University

DOI:

https://doi.org/10.1609/aies.v8i2.36668

Abstract

An unaddressed challenge in interpretable reinforcement learning (RL) is to enable AI agents to integrate preference feedback into the policy generation process. Existing methods collect feedback only after training is complete, neglecting opportunities to inform the learning process. To address this gap, we propose a novel framework to align interpretable policies with human feedback during training. Our framework interleaves preference learning with an evolutionary algorithm, using updated preference estimates to guide the generation of better-aligned policies, and using newly-generated policies to query users to refine the preference model. Evolutionary algorithms enable the exploration of the full space of policies; however, it is intractable to maintain separate preference estimates---like win rates or utility values---for each individual policy in this infinite space. To handle this challenge, we propose to represent policies as feature vectors consisting of a finite set of meaningful attributes. For example, among a set of policies with similar performance, some may be more intuitive or more amenable to human intervention. To maximize the value of each user query, we employ a novel filtering technique to avoid presenting policies that are dominated in all dimensions, as repeated selections of clearly superior policies provide little information. We validate our method with experiments on synthetic preference data on two RL environments. We show that it produces RL policies that are not only better-aligned with user preferences but also more efficient in the number of user queries.

Aligning Agent Policies with Preferences: Human-Centered Interpretable Reinforcement Learning

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section