Probing Preference Representations: A Multi-Dimensional Evaluation and Analysis Method for Reward Models

Authors

  • Chenglong Wang School of Computer Science and Engineering, Northeastern University, Shenyang, China
  • Yifu Huo School of Computer Science and Engineering, Northeastern University, Shenyang, China
  • Yang Gan School of Computer Science and Engineering, Northeastern University, Shenyang, China
  • Yongyu Mu School of Computer Science and Engineering, Northeastern University, Shenyang, China
  • Qiaozhi He School of Computer Science and Engineering, Northeastern University, Shenyang, China
  • Murun Yang School of Computer Science and Engineering, Northeastern University, Shenyang, China
  • Bei Li Meituan Inc.
  • Chunliang Zhang School of Computer Science and Engineering, Northeastern University, Shenyang, China NiuTrans Research, Shenyang, China
  • Tongran Liu CAS Key Laboratory of Behavioral Science, Institute of Psychology, CAS, Beijing, China
  • Anxiang Ma School of Computer Science and Engineering, Northeastern University, Shenyang, China
  • Zhengtao Yu Kunming University of Science and Technology
  • Jingbo Zhu School of Computer Science and Engineering, Northeastern University, Shenyang, China NiuTrans Research, Shenyang, China
  • Tong Xiao School of Computer Science and Engineering, Northeastern University, Shenyang, China NiuTrans Research, Shenyang, China

DOI:

https://doi.org/10.1609/aaai.v40i39.40627

Abstract

Previous methods evaluate reward models by testing them on a fixed pairwise ranking test set, but they typically do not provide performance information on each preference dimension. In this work, we address the evaluation challenge of reward models by probing preference representations. To confirm the effectiveness of this evaluation method, we construct a Multi-dimensional Reward Model Benchmark (MRMBench), a collection of six probing tasks for different preference dimensions. We design it to favor and encourage reward models that better capture preferences across different dimensions. Furthermore, we introduce an analysis method, inference-time probing, which identifies the dimensions used during the reward prediction and enhances its interpretability. Through extensive experiments, we find that MRMBench strongly correlates with LLM alignment performance, supporting it as a reliable reference for developing advanced reward models. By analyzing the evaluation results on MRMBench, we reveal that reward models struggle to simultaneously capture preferences across multiple dimensions, highlighting the potential of multi-objective optimization in reward modeling. Furthermore, our results demonstrate that the proposed inference-time probing method provides a reliable metric for assessing the confidence of reward predictions, leading to improved alignment of large language models.

Downloads

Published

2026-03-14

How to Cite

Wang, C., Huo, Y., Gan, Y., Mu, Y., He, Q., Yang, M., … Xiao, T. (2026). Probing Preference Representations: A Multi-Dimensional Evaluation and Analysis Method for Reward Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(39), 33404–33412. https://doi.org/10.1609/aaai.v40i39.40627

Issue

Section

AAAI Technical Track on Natural Language Processing IV