MRACL: Multi-Reward Space Guided Adaptive Curriculum Reinforcement Learning for LLMs

Authors

  • Wenxuan Liu Du Xiaoman Financial, Beijing, China
  • Liangyu Huo Du Xiaoman Financial, Beijing, China
  • Yi Jing Du Xiaoman Financial, Beijing, China
  • Xiyuan Zhang Du Xiaoman Financial, Beijing, China
  • Jian Xie Du Xiaoman Financial, Beijing, China

DOI:

https://doi.org/10.1609/aaai.v40i44.41101

Abstract

Reinforcement learning (RL) has recently become a powerful yet resource-intensive approach for post-training large language models (LLMs). Incorporating curriculum learning (CL) into RL has been shown to significantly improve training efficiency, particularly in reasoning tasks. However, existing CL methods face substantial challenges in multi-objective RL (MORL) settings, including: (1) difficulty in evaluating model capabilities online, (2) challenges in assessing sample importance under diverse objectives, and (3) inherent trade-offs between online training and offline inference in dynamically designing the curriculum. To address these issues, we propose a Multi-Reward space guided Adaptive Curriculum Learning framework (MRACL), which is the first to incorporate curriculum learning into multi-objective RL. MRACL first constructs a multi-dimensional reward space via offline inference to establish initial reward profiles for each training sample. During training, based on reward space, it estimates the evolving model capabilities by computing the centroid of the space and calculates the sample priority score through its capability distance, optimization direction, and historical evolution, which enables adaptive selection of the most informative training samples at each step, independent of the specific RL algorithm. After each RL training iteration, the reward space is dynamically updated to reflect the model's evolving capabilities and the shifting distribution of sample priorities. Experiments on multi-objective alignment tasks demonstrate that MRACL achieves 1.62× faster convergence compared to state-of-the-art curriculum methods and 2.55× faster than non-curriculum methods. Furthermore, it consistently outperforms all baselines in both win rate and rule-based evaluation. We further provide an in-depth analysis of the key factors contributing to \modelname's effectiveness, along with its advantages, scenarios, and generalization across diverse settings.

Downloads

Published

2026-03-14

How to Cite

Liu, W., Huo, L., Jing, Y., Zhang, X., & Xie, J. (2026). MRACL: Multi-Reward Space Guided Adaptive Curriculum Reinforcement Learning for LLMs. Proceedings of the AAAI Conference on Artificial Intelligence, 40(44), 37663–37672. https://doi.org/10.1609/aaai.v40i44.41101

Issue

Section

AAAI Special Track on AI Alignment