GRAM-R²: Self-Training Generative Foundation Reward Models for Reward Reasoning

Authors

  • Chenglong Wang School of Computer Science and Engineering, Northeastern University, Shenyang, China
  • Yongyu Mu School of Computer Science and Engineering, Northeastern University, Shenyang, China
  • Hang Zhou School of Computer Science and Engineering, Northeastern University, Shenyang, China
  • Yifu Huo School of Computer Science and Engineering, Northeastern University, Shenyang, China
  • Ziming Zhu School of Computer Science and Engineering, Northeastern University, Shenyang, China
  • Jiali Zeng Pattern Recognition Center, WeChat AI, Tencent Inc, China
  • Murun Yang School of Computer Science and Engineering, Northeastern University, Shenyang, China
  • Bei Li School of Computer Science and Engineering, Northeastern University, Shenyang, China
  • Xiaoyang Hao School of Computer Science and Engineering, Northeastern University, Shenyang, China
  • Chunliang Zhang School of Computer Science and Engineering, Northeastern University, Shenyang, China
  • Fandong Meng Pattern Recognition Center, WeChat AI, Tencent Inc, China
  • Jingbo Zhu School of Computer Science and Engineering, Northeastern University, Shenyang, China
  • Tong Xiao School of Computer Science and Engineering, Northeastern University, Shenyang, China

DOI:

https://doi.org/10.1609/aaai.v40i39.40626

Abstract

Major progress in reward modeling over recent years has been driven by a paradigm shift from task-specific designs to generalist reward models. Despite this trend, developing effective reward models remains a fundamental challenge: the heavy reliance on large-scale labeled preference data. Pre-training on abundant unlabeled data offers a promising direction, but existing approaches fall short in instilling explicit reasoning capabilities into reward models. To bridge this gap, we propose a self-training approach that can leverage unlabeled data to scale up reward reasoning in reward models. Based on this approach, we develop GRAM-R² a generative reward model trained to produce not only preference labels but also accompanying reward rationales. GRAM-R² can serve as a foundation model for reward reasoning and can be applied to a wide range of tasks with minimal or no additional fine-tuning. It can support downstream applications such as policy optimization and task-specific reward tuning. Experiments on response ranking, task adaptation, and reinforcement learning from human feedback demonstrate that GRAM-R² consistently delivers strong performance, outperforming several strong discriminative and generative baselines.

Published

2026-03-14

How to Cite

Wang, C., Mu, Y., Zhou, H., Huo, Y., Zhu, Z., Zeng, J., … Xiao, T. (2026). GRAM-R²: Self-Training Generative Foundation Reward Models for Reward Reasoning. Proceedings of the AAAI Conference on Artificial Intelligence, 40(39), 33395–33403. https://doi.org/10.1609/aaai.v40i39.40626

Issue

Section

AAAI Technical Track on Natural Language Processing IV