Step-GRPO: Enhancing Reasoning Quality and Efficiency via Structured PRM-Based Reinforcement Learning

Weijie Li; Jin Wang; Liang-Chih Yu; Xuejie Zhang

doi:10.1609/aaai.v40i37.40441

Authors

Weijie Li Yunnan University
Jin Wang Yunnan University
Liang-Chih Yu Yuan Ze University
Xuejie Zhang Yunnan University

DOI:

https://doi.org/10.1609/aaai.v40i37.40441

Abstract

Large reasoning models (LRMs) improve performance at test time by thinking longer, but this often leads to overthinking and high computational cost. To address this, recent reinforcement learning (RL) methods adopt outcome-level rewards, such as rule- or prompt-based signals, that favor shorter correct reasoning paths but often overlook reasoning quality. While such rewards neglect intermediate reasoning, dense supervision from process reward models (PRMs) has proven more effective in promoting coherent and high-quality reasoning. However, static PRM supervision introduces two challenges: reward hacking, since fixed rewards poorly capture global reasoning objectives, and the high training cost of obtaining dense reward labels at scale. To overcome these issues, we propose Step Group Relative Policy Optimization (Step-GRPO), a GRPO-based method that integrates step-level PRM signals into sparse trajectory-level feedback, avoiding costly step-level supervision while improving reasoning quality beyond accuracy. In addition, Step-GRPO employs a step-attention mechanism that captures inter-step dependencies and emphasizes critical reasoning steps, effectively mitigating reward hacking. We apply Step-GRPO to train large language models and observe consistent gains in reasoning quality, accuracy, and shorter reasoning traces across multiple math benchmarks, outperforming reinforcement learning baselines at substantially lower cost. Notably, the proposed model achieves 36.7 percent accuracy on AIME 2024 with 11,000 training samples and a training cost of 38 US dollars, surpassing baselines that require over 1,000 US dollars and more than 40,000 samples, demonstrating strong cost-effectiveness and scalability.

Step-GRPO: Enhancing Reasoning Quality and Efficiency via Structured PRM-Based Reinforcement Learning

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information