URPO: A Unified Reward & Policy Optimization Framework for Large Language Models

Authors

  • Songshuo Lu Moore Threads AI
  • Hua Wang Moore Threads AI
  • Zhi Chen Moore Threads AI
  • Yaohua Tang Moore Threads AI

DOI:

https://doi.org/10.1609/aaai.v40i38.40507

Abstract

Large-scale alignment pipelines typically pair a policy model with a separately trained reward model whose parameters remain frozen during reinforcement learning (RL). This separation creates a complex, resource-intensive pipeline and leads to a performance ceiling. We propose a novel framework, Unified Reward & Policy Optimization (URPO), that unifies instruction-following (“player”) and reward modeling (“referee”) into a single model and a single training phase. Our method recasts all alignment data-including preference pairs, verifiable reasoning, and open-ended instructions-into a unified generative format optimized by a single Group-Relative Policy Optimization (GRPO) loop. This enables the model to learn from ground-truth preferences and verifiable logic while simultaneously generating its own rewards for open-ended tasks. Experiments on the Qwen2.5-7B model demonstrate that URPO significantly outperforms a strong baseline using a separate generative reward model, boosting the instructionfollowing score on AlpacaEval to 44.84 and achieving a 36% relative improvement on the challenging AIME reasoning benchmark. Furthermore, URPO cultivates a superior internal evaluator as a byproduct of training, achieving a RewardBench score of 85.15 and surpassing the dedicated reward model it replaces (83.55). By eliminating the need for a separate reward model and fostering a co-evolutionary dynamic, URPO presents a simpler, more efficient, and more effective path towards robustly aligned language models.

Downloads

Published

2026-03-14

How to Cite

Lu, S., Wang, H., Chen, Z., & Tang, Y. (2026). URPO: A Unified Reward & Policy Optimization Framework for Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(38), 32329–32337. https://doi.org/10.1609/aaai.v40i38.40507

Issue

Section

AAAI Technical Track on Natural Language Processing III