Rethinking the Sampling Criteria in Reinforcement Learning for LLM Reasoning: A Competence-Difficulty Alignment Perspective

Authors

  • Deyang Kong National Engineering Research Center for Software Engineering, Peking University, Beijing, China Meituan Group, Beijing, China
  • Qi Guo National Engineering Research Center for Software Engineering, Peking University, Beijing, China Meituan Group, Beijing, China
  • Xiangyu Xi Meituan Group, Beijing, China
  • Wei Wang Meituan Group, Beijing, China
  • Jingang Wang Meituan Group, Beijing, China
  • Xunliang Cai Meituan Group, Beijing, China
  • Shikun Zhang National Engineering Research Center for Software Engineering, Peking University, Beijing, China
  • Wei Ye National Engineering Research Center for Software Engineering, Peking University, Beijing, China

DOI:

https://doi.org/10.1609/aaai.v40i37.40408

Abstract

The low sampling efficiency during the rollout phase poses a significant challenge to scaling reinforcement learning for large language model reasoning. Existing methods attempt to improve efficiency by scheduling problems based on problem difficulties. However, these approaches suffer from unstable and biased estimations of problem difficulty and fail to capture the alignment between model competence and problem difficulty in RL training, leading to suboptimal results. To address these challenges, we introduce Competence-Difficulty Alignment Sampling (CDAS). This approach allows for accurate and stable estimation of problem difficulties by aggregating historical performance discrepancies across problems. Subsequently, model competence is quantified to adaptively select problems whose difficulties align with the model's current competence using a fixed-point system. Extensive experiments in mathematical RL training show that CDAS consistently outperforms strong baselines, achieving the highest average accuracy of 45.89%. Furthermore, CDAS reduces the training step time overhead by 57.06% compared to the widely-used Dynamic Sampling strategy, verifying the efficiency of CDAS. Additional experiments on different tasks, model architectures, and model sizes demonstrate the generalization capability of CDAS.

Downloads

Published

2026-03-14

How to Cite

Kong, D., Guo, Q., Xi, X., Wang, W., Wang, J., Cai, X., … Ye, W. (2026). Rethinking the Sampling Criteria in Reinforcement Learning for LLM Reasoning: A Competence-Difficulty Alignment Perspective. Proceedings of the AAAI Conference on Artificial Intelligence, 40(37), 31438–31446. https://doi.org/10.1609/aaai.v40i37.40408

Issue

Section

AAAI Technical Track on Natural Language Processing II