MetaAct-RL: Training Language Models for Reasoning Through Meta-Action-Based Reinforcement Learning
DOI:
https://doi.org/10.1609/aaai.v40i40.40694Abstract
Outcome-based reinforcement learning has made notable advances in training language models (LMs) for reasoning. However, without explicit incentives and controls, this paradigm has limitations and instability in eliciting high-quality reasoning trajectories with diverse actions—particularly for models whose pretraining lacked extensive reasoning-related data. To this end, we introduce MetaAct-RL, a new RL framework that frames LMs’ thinking as sequential decision making over meta-actions. In this framework, the model chooses and executes a high-level action at each step—such as forward reasoning, critique, or refinement—to gradually reach the correct answer. To encourage deeper exploration, richer action diversity, and to improve sampling efficiency in the RL optimization process, MetaAct-RL incorporates appropriate length-based reward and regularization, and a key-state restart mechanism. Extensive experiments across six benchmarks show that MetaAct-RL improves reasoning performance by 7.99 on Llama3.2-1B and 7.17 on Llama3.1-8B relative to vanilla RL method. Moreover, on the challenging AIME-2024, our method outperforms the vanilla RL by 7.5 with Qwen2.5-1.5B.Published
2026-03-14
How to Cite
Xi, Z., Wang, Y., Ding, Y., Li, G., Jin, S., Liu, S., Huang, J., Yang, D., Tang, J., Hong, B., Ye, J., Dou, S., Zhang, M., Guan, J., Wu, W., Zheng, R., Gui, T., Zhang, Q., & Huang, X. (2026). MetaAct-RL: Training Language Models for Reasoning Through Meta-Action-Based Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 40(40), 34006-34015. https://doi.org/10.1609/aaai.v40i40.40694
Issue
Section
AAAI Technical Track on Natural Language Processing V