Alleviating Shifted Distribution in Human Preference Alignment through Meta-Learning

Authors

  • Shihan Dou School of Computer Science, Fudan University, Shanghai, China
  • Yan Liu School of Computer Science, Fudan University, Shanghai, China
  • Enyu Zhou School of Computer Science, Fudan University, Shanghai, China
  • Songyang Gao School of Computer Science, Fudan University, Shanghai, China
  • Tianlong Li School of Computer Science, Fudan University, Shanghai, China
  • Limao Xiong School of Computer Science, Fudan University, Shanghai, China
  • Xin Zhao Ant Group, Shanghai, China
  • Haoxiang Jia School of Computer Science, Peking University, Beijing, China
  • Junjie Ye School of Computer Science, Fudan University, Shanghai, China
  • Rui Zheng School of Computer Science, Fudan University, Shanghai, China
  • Tao Gui Institute of Modern Languages and Linguistics, Fudan University, Shanghai, China
  • Qi Zhang School of Computer Science, Fudan University, Shanghai, China Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai, China
  • Xuanjing Huang School of Computer Science, Fudan University, Shanghai, China Shanghai Collaborative Innovation Center of Intelligent Visual Computing, Shanghai, China

DOI:

https://doi.org/10.1609/aaai.v39i22.34552

Abstract

The capability of the reward model (RM) is crucial for the success of Reinforcement Learning from Human Feedback (RLHF) in aligning with human preferences. However, as training progresses, the output space distribution of the policy model shifts. The RM, initially trained on responses sampled from the output distribution of the early policy model, gradually loses its ability to distinguish between responses from the newly shifted distribution. This issue is further compounded when the RM, trained on a specific data distribution, struggles to generalize to examples outside of that distribution. These two issues can be united as a challenge posed by the shifted distribution of the environment. To surmount this challenge, we introduce MetaRM, a novel method leveraging meta-learning to adapt the RM to the shifted environment distribution. MetaRM optimizes the RM in an alternating way, by preserving both the preferences of the original preference pairs, as well as maximizing discrimination power over new examples of the shifted distribution. Extensive experiments demonstrate that MetaRM can iteratively enhance the performance of human preference alignment by improving the RM's capacity to identify subtle differences in samples of shifted distributions.

Published

2025-04-11

How to Cite

Dou, S., Liu, Y., Zhou, E., Gao, S., Li, T., Xiong, L., … Huang, X. (2025). Alleviating Shifted Distribution in Human Preference Alignment through Meta-Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 39(22), 23805–23813. https://doi.org/10.1609/aaai.v39i22.34552

Issue

Section

AAAI Technical Track on Natural Language Processing I