Alleviating Shifted Distribution in Human Preference Alignment through Meta-Learning

Shihan Dou; Yan Liu; Enyu Zhou; Songyang Gao; Tianlong Li; Limao Xiong; Xin Zhao; Haoxiang Jia; Junjie Ye; Rui Zheng; Tao Gui; Qi Zhang; Xuanjing Huang

doi:10.1609/aaai.v39i22.34552

Authors

Shihan Dou School of Computer Science, Fudan University, Shanghai, China
Yan Liu School of Computer Science, Fudan University, Shanghai, China
Enyu Zhou School of Computer Science, Fudan University, Shanghai, China
Songyang Gao School of Computer Science, Fudan University, Shanghai, China
Tianlong Li School of Computer Science, Fudan University, Shanghai, China
Limao Xiong School of Computer Science, Fudan University, Shanghai, China
Xin Zhao Ant Group, Shanghai, China
Haoxiang Jia School of Computer Science, Peking University, Beijing, China
Junjie Ye School of Computer Science, Fudan University, Shanghai, China
Rui Zheng School of Computer Science, Fudan University, Shanghai, China
Tao Gui Institute of Modern Languages and Linguistics, Fudan University, Shanghai, China
Qi Zhang School of Computer Science, Fudan University, Shanghai, China Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai, China
Xuanjing Huang School of Computer Science, Fudan University, Shanghai, China Shanghai Collaborative Innovation Center of Intelligent Visual Computing, Shanghai, China

DOI:

https://doi.org/10.1609/aaai.v39i22.34552

Abstract

The capability of the reward model (RM) is crucial for the success of Reinforcement Learning from Human Feedback (RLHF) in aligning with human preferences. However, as training progresses, the output space distribution of the policy model shifts. The RM, initially trained on responses sampled from the output distribution of the early policy model, gradually loses its ability to distinguish between responses from the newly shifted distribution. This issue is further compounded when the RM, trained on a specific data distribution, struggles to generalize to examples outside of that distribution. These two issues can be united as a challenge posed by the shifted distribution of the environment. To surmount this challenge, we introduce MetaRM, a novel method leveraging meta-learning to adapt the RM to the shifted environment distribution. MetaRM optimizes the RM in an alternating way, by preserving both the preferences of the original preference pairs, as well as maximizing discrimination power over new examples of the shifted distribution. Extensive experiments demonstrate that MetaRM can iteratively enhance the performance of human preference alignment by improving the RM's capacity to identify subtle differences in samples of shifted distributions.

Alleviating Shifted Distribution in Human Preference Alignment through Meta-Learning

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information