From Imitation to Discrimination: Toward a Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks

Authors

  • Changpeng Yang Xiaomi Corporation
  • Jinyang Wu Tsinghua University
  • Yuchen Liu Xiaomi Corporation
  • Shuai Zhang Tsinghua University
  • Yang Li Xiaomi Corporation
  • Qiliang Liang Peking University
  • Hongzhen Wang Xiaomi Corporation
  • Shuai Nie Xiaomi Corporation
  • Jiaming Xu Xiaomi Corporation
  • Runyu Shi Xiaomi Corporation
  • Ying Huang Xiaomi Corporation
  • Guoquan Zhang Xiaomi Corporation

DOI:

https://doi.org/10.1609/aaai.v40i40.40717

Abstract

Reinforcement learning has emerged as a paradigm for post-training large language models, boosting their reasoning capabilities. Such approaches compute an advantage value for each sample, reflecting better or worse performance than expected, thereby yielding both positive and negative signals for training. However, existing approaches often mix them indiscriminately, especially in the early stages, leading to ambiguous guidance and limited gains. To address this issue, we propose CAPO (Curriculum Advantage Policy Optimization), an adaptive curriculum mechanism based on advantage signals. The proposed mechanism bootstraps imitation learning with positive-only advantage samples to establish robust foundations, and subsequently introduces negative signals to cultivate discriminative capabilities, thereby improving generalization across complex scenarios. Compatible with diverse optimization methods including GRPO, PPO, RLOO, and Reinforce++, our method consistently achieves stable and significant improvements in mathematical reasoning tasks, and further generalizes effectively to multimodal Graphical User Interface (GUI) reasoning scenarios, establishing itself as a versatile and robust optimization paradigm.

Downloads

Published

2026-03-14

How to Cite

Yang, C., Wu, J., Liu, Y., Zhang, S., Li, Y., Liang, Q., … Zhang, G. (2026). From Imitation to Discrimination: Toward a Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks. Proceedings of the AAAI Conference on Artificial Intelligence, 40(40), 34214–34222. https://doi.org/10.1609/aaai.v40i40.40717

Issue

Section

AAAI Technical Track on Natural Language Processing V