Aligning Agent Policies with Preferences: Human-Centered Interpretable Reinforcement Learning
DOI:
https://doi.org/10.1609/aies.v8i2.36668Abstract
An unaddressed challenge in interpretable reinforcement learning (RL) is to enable AI agents to integrate preference feedback into the policy generation process. Existing methods collect feedback only after training is complete, neglecting opportunities to inform the learning process. To address this gap, we propose a novel framework to align interpretable policies with human feedback during training. Our framework interleaves preference learning with an evolutionary algorithm, using updated preference estimates to guide the generation of better-aligned policies, and using newly-generated policies to query users to refine the preference model. Evolutionary algorithms enable the exploration of the full space of policies; however, it is intractable to maintain separate preference estimates---like win rates or utility values---for each individual policy in this infinite space. To handle this challenge, we propose to represent policies as feature vectors consisting of a finite set of meaningful attributes. For example, among a set of policies with similar performance, some may be more intuitive or more amenable to human intervention. To maximize the value of each user query, we employ a novel filtering technique to avoid presenting policies that are dominated in all dimensions, as repeated selections of clearly superior policies provide little information. We validate our method with experiments on synthetic preference data on two RL environments. We show that it produces RL policies that are not only better-aligned with user preferences but also more efficient in the number of user queries.Downloads
Published
2025-10-15
How to Cite
Milani, S., Zhang, Z., Topin, N., Xia, L., & Fang, F. (2025). Aligning Agent Policies with Preferences: Human-Centered Interpretable Reinforcement Learning. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 8(2), 1711-1723. https://doi.org/10.1609/aies.v8i2.36668