Step-GRPO: Enhancing Reasoning Quality and Efficiency via Structured PRM-Based Reinforcement Learning
DOI:
https://doi.org/10.1609/aaai.v40i37.40441Abstract
Large reasoning models (LRMs) improve performance at test time by thinking longer, but this often leads to overthinking and high computational cost. To address this, recent reinforcement learning (RL) methods adopt outcome-level rewards, such as rule- or prompt-based signals, that favor shorter correct reasoning paths but often overlook reasoning quality. While such rewards neglect intermediate reasoning, dense supervision from process reward models (PRMs) has proven more effective in promoting coherent and high-quality reasoning. However, static PRM supervision introduces two challenges: reward hacking, since fixed rewards poorly capture global reasoning objectives, and the high training cost of obtaining dense reward labels at scale. To overcome these issues, we propose Step Group Relative Policy Optimization (Step-GRPO), a GRPO-based method that integrates step-level PRM signals into sparse trajectory-level feedback, avoiding costly step-level supervision while improving reasoning quality beyond accuracy. In addition, Step-GRPO employs a step-attention mechanism that captures inter-step dependencies and emphasizes critical reasoning steps, effectively mitigating reward hacking. We apply Step-GRPO to train large language models and observe consistent gains in reasoning quality, accuracy, and shorter reasoning traces across multiple math benchmarks, outperforming reinforcement learning baselines at substantially lower cost. Notably, the proposed model achieves 36.7 percent accuracy on AIME 2024 with 11,000 training samples and a training cost of 38 US dollars, surpassing baselines that require over 1,000 US dollars and more than 40,000 samples, demonstrating strong cost-effectiveness and scalability.Downloads
Published
2026-03-14
How to Cite
Li, W., Wang, J., Yu, L.-C., & Zhang, X. (2026). Step-GRPO: Enhancing Reasoning Quality and Efficiency via Structured PRM-Based Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 40(37), 31734-31742. https://doi.org/10.1609/aaai.v40i37.40441
Issue
Section
AAAI Technical Track on Natural Language Processing II