Rethinking the Sampling Criteria in Reinforcement Learning for LLM Reasoning: A Competence-Difficulty Alignment Perspective

Deyang Kong; Qi Guo; Xiangyu Xi; Wei Wang; Jingang Wang; Xunliang Cai; Shikun Zhang; Wei Ye

doi:10.1609/aaai.v40i37.40408

Authors

Deyang Kong National Engineering Research Center for Software Engineering, Peking University, Beijing, China Meituan Group, Beijing, China
Qi Guo National Engineering Research Center for Software Engineering, Peking University, Beijing, China Meituan Group, Beijing, China
Xiangyu Xi Meituan Group, Beijing, China
Wei Wang Meituan Group, Beijing, China
Jingang Wang Meituan Group, Beijing, China
Xunliang Cai Meituan Group, Beijing, China
Shikun Zhang National Engineering Research Center for Software Engineering, Peking University, Beijing, China
Wei Ye National Engineering Research Center for Software Engineering, Peking University, Beijing, China

DOI:

https://doi.org/10.1609/aaai.v40i37.40408

Abstract

The low sampling efficiency during the rollout phase poses a significant challenge to scaling reinforcement learning for large language model reasoning. Existing methods attempt to improve efficiency by scheduling problems based on problem difficulties. However, these approaches suffer from unstable and biased estimations of problem difficulty and fail to capture the alignment between model competence and problem difficulty in RL training, leading to suboptimal results. To address these challenges, we introduce Competence-Difficulty Alignment Sampling (CDAS). This approach allows for accurate and stable estimation of problem difficulties by aggregating historical performance discrepancies across problems. Subsequently, model competence is quantified to adaptively select problems whose difficulties align with the model's current competence using a fixed-point system. Extensive experiments in mathematical RL training show that CDAS consistently outperforms strong baselines, achieving the highest average accuracy of 45.89%. Furthermore, CDAS reduces the training step time overhead by 57.06% compared to the widely-used Dynamic Sampling strategy, verifying the efficiency of CDAS. Additional experiments on different tasks, model architectures, and model sizes demonstrate the generalization capability of CDAS.

Rethinking the Sampling Criteria in Reinforcement Learning for LLM Reasoning: A Competence-Difficulty Alignment Perspective

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information