GRAM-R²: Self-Training Generative Foundation Reward Models for Reward Reasoning

Chenglong Wang; Yongyu Mu; Hang Zhou; Yifu Huo; Ziming Zhu; Jiali Zeng; Murun Yang; Bei Li; Xiaoyang Hao; Chunliang Zhang; Fandong Meng; Jingbo Zhu; Tong Xiao

doi:10.1609/aaai.v40i39.40626

Authors

Chenglong Wang School of Computer Science and Engineering, Northeastern University, Shenyang, China
Yongyu Mu School of Computer Science and Engineering, Northeastern University, Shenyang, China
Hang Zhou School of Computer Science and Engineering, Northeastern University, Shenyang, China
Yifu Huo School of Computer Science and Engineering, Northeastern University, Shenyang, China
Ziming Zhu School of Computer Science and Engineering, Northeastern University, Shenyang, China
Jiali Zeng Pattern Recognition Center, WeChat AI, Tencent Inc, China
Murun Yang School of Computer Science and Engineering, Northeastern University, Shenyang, China
Bei Li School of Computer Science and Engineering, Northeastern University, Shenyang, China
Xiaoyang Hao School of Computer Science and Engineering, Northeastern University, Shenyang, China
Chunliang Zhang School of Computer Science and Engineering, Northeastern University, Shenyang, China
Fandong Meng Pattern Recognition Center, WeChat AI, Tencent Inc, China
Jingbo Zhu School of Computer Science and Engineering, Northeastern University, Shenyang, China
Tong Xiao School of Computer Science and Engineering, Northeastern University, Shenyang, China

DOI:

https://doi.org/10.1609/aaai.v40i39.40626

Abstract

Major progress in reward modeling over recent years has been driven by a paradigm shift from task-specific designs to generalist reward models. Despite this trend, developing effective reward models remains a fundamental challenge: the heavy reliance on large-scale labeled preference data. Pre-training on abundant unlabeled data offers a promising direction, but existing approaches fall short in instilling explicit reasoning capabilities into reward models. To bridge this gap, we propose a self-training approach that can leverage unlabeled data to scale up reward reasoning in reward models. Based on this approach, we develop GRAM-R² a generative reward model trained to produce not only preference labels but also accompanying reward rationales. GRAM-R² can serve as a foundation model for reward reasoning and can be applied to a wide range of tasks with minimal or no additional fine-tuning. It can support downstream applications such as policy optimization and task-specific reward tuning. Experiments on response ranking, task adaptation, and reinforcement learning from human feedback demonstrate that GRAM-R² consistently delivers strong performance, outperforming several strong discriminative and generative baselines.

GRAM-R²: Self-Training Generative Foundation Reward Models for Reward Reasoning

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information