MRACL: Multi-Reward Space Guided Adaptive Curriculum Reinforcement Learning for LLMs

Wenxuan Liu; Liangyu Huo; Yi Jing; Xiyuan Zhang; Jian Xie

doi:10.1609/aaai.v40i44.41101

Authors

Wenxuan Liu Du Xiaoman Financial, Beijing, China
Liangyu Huo Du Xiaoman Financial, Beijing, China
Yi Jing Du Xiaoman Financial, Beijing, China
Xiyuan Zhang Du Xiaoman Financial, Beijing, China
Jian Xie Du Xiaoman Financial, Beijing, China

DOI:

https://doi.org/10.1609/aaai.v40i44.41101

Abstract

Reinforcement learning (RL) has recently become a powerful yet resource-intensive approach for post-training large language models (LLMs). Incorporating curriculum learning (CL) into RL has been shown to significantly improve training efficiency, particularly in reasoning tasks. However, existing CL methods face substantial challenges in multi-objective RL (MORL) settings, including: (1) difficulty in evaluating model capabilities online, (2) challenges in assessing sample importance under diverse objectives, and (3) inherent trade-offs between online training and offline inference in dynamically designing the curriculum. To address these issues, we propose a Multi-Reward space guided Adaptive Curriculum Learning framework (MRACL), which is the first to incorporate curriculum learning into multi-objective RL. MRACL first constructs a multi-dimensional reward space via offline inference to establish initial reward profiles for each training sample. During training, based on reward space, it estimates the evolving model capabilities by computing the centroid of the space and calculates the sample priority score through its capability distance, optimization direction, and historical evolution, which enables adaptive selection of the most informative training samples at each step, independent of the specific RL algorithm. After each RL training iteration, the reward space is dynamically updated to reflect the model's evolving capabilities and the shifting distribution of sample priorities. Experiments on multi-objective alignment tasks demonstrate that MRACL achieves 1.62× faster convergence compared to state-of-the-art curriculum methods and 2.55× faster than non-curriculum methods. Furthermore, it consistently outperforms all baselines in both win rate and rule-based evaluation. We further provide an in-depth analysis of the key factors contributing to \modelname's effectiveness, along with its advantages, scenarios, and generalization across diverse settings.

MRACL: Multi-Reward Space Guided Adaptive Curriculum Reinforcement Learning for LLMs

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information