DUO: Diverse, Uncertain, On-Policy Query Generation and Selection for Reinforcement Learning from Human Feedback

Authors

  • Xuening Feng Shanghai Jiao Tong University
  • Zhaohui Jiang Shanghai Jiao Tong University
  • Timo Kaufmann Ludwig-Maximilians-Universität München Munich Center of Machine Learning
  • Puchen Xu Shanghai Jiao Tong University
  • Eyke Hüllermeier Ludwig-Maximilians-Universität München Munich Center of Machine Learning German Research Center for Artificial Intelligence
  • Paul Weng Duke Kunshan University
  • Yifei Zhu Shanghai Jiao Tong University

DOI:

https://doi.org/10.1609/aaai.v39i16.33824

Abstract

Defining a reward function is usually a challenging but critical task for the system designer in reinforcement learning, especially when specifying complex behaviors. Reinforcement learning from human feedback (RLHF) emerges as a promising approach to circumvent this. In RLHF, the agent typically learns a reward function by querying a human teacher using pairwise comparisons of trajectory segments. A key question in this domain is how to reduce the number of queries necessary to learn an informative reward function since asking a human teacher too many queries is impractical and costly. To tackle this question, we propose DUO, a novel method for diverse, uncertain, on-policy query generation and selection in RLHF. Our method produces queries that are (1) more relevant for policy training (via an on-policy criterion), (2) more informative (via a principled measure of epistemic uncertainty), and (3) diverse (via a clustering-based filter). Experimental results on a variety of locomotion and robotic manipulation tasks demonstrate that our method can outperform state-of-the-art RLHF methods given the same total budget of queries, while being robust to possibly irrational teachers.

Downloads

Published

2025-04-11

How to Cite

Feng, X., Jiang, Z., Kaufmann, T., Xu, P., Hüllermeier, E., Weng, P., & Zhu, Y. (2025). DUO: Diverse, Uncertain, On-Policy Query Generation and Selection for Reinforcement Learning from Human Feedback. Proceedings of the AAAI Conference on Artificial Intelligence, 39(16), 16604–16612. https://doi.org/10.1609/aaai.v39i16.33824

Issue

Section

AAAI Technical Track on Machine Learning II