Probing Preference Representations: A Multi-Dimensional Evaluation and Analysis Method for Reward Models
DOI:
https://doi.org/10.1609/aaai.v40i39.40627Abstract
Previous methods evaluate reward models by testing them on a fixed pairwise ranking test set, but they typically do not provide performance information on each preference dimension. In this work, we address the evaluation challenge of reward models by probing preference representations. To confirm the effectiveness of this evaluation method, we construct a Multi-dimensional Reward Model Benchmark (MRMBench), a collection of six probing tasks for different preference dimensions. We design it to favor and encourage reward models that better capture preferences across different dimensions. Furthermore, we introduce an analysis method, inference-time probing, which identifies the dimensions used during the reward prediction and enhances its interpretability. Through extensive experiments, we find that MRMBench strongly correlates with LLM alignment performance, supporting it as a reliable reference for developing advanced reward models. By analyzing the evaluation results on MRMBench, we reveal that reward models struggle to simultaneously capture preferences across multiple dimensions, highlighting the potential of multi-objective optimization in reward modeling. Furthermore, our results demonstrate that the proposed inference-time probing method provides a reliable metric for assessing the confidence of reward predictions, leading to improved alignment of large language models.Published
2026-03-14
How to Cite
Wang, C., Huo, Y., Gan, Y., Mu, Y., He, Q., Yang, M., … Xiao, T. (2026). Probing Preference Representations: A Multi-Dimensional Evaluation and Analysis Method for Reward Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(39), 33404–33412. https://doi.org/10.1609/aaai.v40i39.40627
Issue
Section
AAAI Technical Track on Natural Language Processing IV