When Human Preferences Flip: An Instance-Dependent Robust Loss for RLHF

Authors

  • Yifan Xu Hong Kong Baptist University
  • Xichen Ye Fudan University
  • Yifan Chen Hong Kong Baptist University
  • Qiaosheng Zhang Shanghai Artificial Intelligence Laboratory

DOI:

https://doi.org/10.1609/aaai.v40i44.41143

Abstract

Quality of datasets plays an important role in large language model (LLM) alignment. In collecting human feedback, however, preference flipping is ubiquitous and causes corruption in data annotation; the issue necessitates the alignment algorithms with improved robustness against potential flipped pairs. To this end, this paper introduces a Flipping-Aware Direct Preference Optimization (FA-DPO) algorithm tailored to preference flipping from a reinforcement learning with human feedback (RLHF) perspective. We dissect the inherent human intention model and the preference flipping mechanism introduced by external factors as two distinct stages; in the latter, we introduce an instance-dependent flipping probability on the basis of the Bradley-Terry (BT) model. Further, by leveraging features relevant to preference annotation, we capture uncertainty in judgments and model preference flipping patterns. In practice, we design a simple yet efficient iterative optimization algorithm compatible with the original RLHF and Direct Preference Optimization (DPO) algorithms. In our experiments, we investigate the instance-dependent preference flipping model under multiple circumstances for evaluation of our proposed method, as well as other baseline methods.

Downloads

Published

2026-03-14

How to Cite

Xu, Y., Ye, X., Chen, Y., & Zhang, Q. (2026). When Human Preferences Flip: An Instance-Dependent Robust Loss for RLHF. Proceedings of the AAAI Conference on Artificial Intelligence, 40(44), 38057–38065. https://doi.org/10.1609/aaai.v40i44.41143

Issue

Section

AAAI Special Track on AI Alignment