Probing Preference Representations: A Multi-Dimensional Evaluation and Analysis Method for Reward Models

Chenglong Wang; Yifu Huo; Yang Gan; Yongyu Mu; Qiaozhi He; Murun Yang; Bei Li; Chunliang Zhang; Tongran Liu; Anxiang Ma; Zhengtao Yu; Jingbo Zhu; Tong Xiao

doi:10.1609/aaai.v40i39.40627

Authors

Chenglong Wang School of Computer Science and Engineering, Northeastern University, Shenyang, China
Yifu Huo School of Computer Science and Engineering, Northeastern University, Shenyang, China
Yang Gan School of Computer Science and Engineering, Northeastern University, Shenyang, China
Yongyu Mu School of Computer Science and Engineering, Northeastern University, Shenyang, China
Qiaozhi He School of Computer Science and Engineering, Northeastern University, Shenyang, China
Murun Yang School of Computer Science and Engineering, Northeastern University, Shenyang, China
Bei Li Meituan Inc.
Chunliang Zhang School of Computer Science and Engineering, Northeastern University, Shenyang, China NiuTrans Research, Shenyang, China
Tongran Liu CAS Key Laboratory of Behavioral Science, Institute of Psychology, CAS, Beijing, China
Anxiang Ma School of Computer Science and Engineering, Northeastern University, Shenyang, China
Zhengtao Yu Kunming University of Science and Technology
Jingbo Zhu School of Computer Science and Engineering, Northeastern University, Shenyang, China NiuTrans Research, Shenyang, China
Tong Xiao School of Computer Science and Engineering, Northeastern University, Shenyang, China NiuTrans Research, Shenyang, China

DOI:

https://doi.org/10.1609/aaai.v40i39.40627

Abstract

Previous methods evaluate reward models by testing them on a fixed pairwise ranking test set, but they typically do not provide performance information on each preference dimension. In this work, we address the evaluation challenge of reward models by probing preference representations. To confirm the effectiveness of this evaluation method, we construct a Multi-dimensional Reward Model Benchmark (MRMBench), a collection of six probing tasks for different preference dimensions. We design it to favor and encourage reward models that better capture preferences across different dimensions. Furthermore, we introduce an analysis method, inference-time probing, which identifies the dimensions used during the reward prediction and enhances its interpretability. Through extensive experiments, we find that MRMBench strongly correlates with LLM alignment performance, supporting it as a reliable reference for developing advanced reward models. By analyzing the evaluation results on MRMBench, we reveal that reward models struggle to simultaneously capture preferences across multiple dimensions, highlighting the potential of multi-objective optimization in reward modeling. Furthermore, our results demonstrate that the proposed inference-time probing method provides a reliable metric for assessing the confidence of reward predictions, leading to improved alignment of large language models.

Probing Preference Representations: A Multi-Dimensional Evaluation and Analysis Method for Reward Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information