MetaAct-RL: Training Language Models for Reasoning Through Meta-Action-Based Reinforcement Learning

Authors

  • Zhiheng Xi Fudan University
  • Yuhui Wang Fudan University
  • Yiwen Ding Fudan University
  • Guanyu Li Fudan University
  • Senjie Jin Fudan University
  • Shichun Liu Fudan University
  • Jixuan Huang Fudan University
  • Dingwen Yang Fudan University
  • Jiafu Tang Fudan University
  • Boyang Hong Fudan University
  • Junjie Ye Fudan University
  • Shihan Dou Fudan University
  • Ming Zhang Fudan University
  • Jian Guan Ant Research
  • Wei Wu Ant Research
  • Rui Zheng Fudan University
  • Tao Gui Fudan University, Shanghai Innovation Institute
  • Qi Zhang Fudan University, wispaper.ai
  • Xuanjing Huang Fudan University

DOI:

https://doi.org/10.1609/aaai.v40i40.40694

Abstract

Outcome-based reinforcement learning has made notable advances in training language models (LMs) for reasoning. However, without explicit incentives and controls, this paradigm has limitations and instability in eliciting high-quality reasoning trajectories with diverse actions—particularly for models whose pretraining lacked extensive reasoning-related data. To this end, we introduce MetaAct-RL, a new RL framework that frames LMs’ thinking as sequential decision making over meta-actions. In this framework, the model chooses and executes a high-level action at each step—such as forward reasoning, critique, or refinement—to gradually reach the correct answer. To encourage deeper exploration, richer action diversity, and to improve sampling efficiency in the RL optimization process, MetaAct-RL incorporates appropriate length-based reward and regularization, and a key-state restart mechanism. Extensive experiments across six benchmarks show that MetaAct-RL improves reasoning performance by 7.99 on Llama3.2-1B and 7.17 on Llama3.1-8B relative to vanilla RL method. Moreover, on the challenging AIME-2024, our method outperforms the vanilla RL by 7.5 with Qwen2.5-1.5B.

Downloads

Published

2026-03-14

How to Cite

Xi, Z., Wang, Y., Ding, Y., Li, G., Jin, S., Liu, S., Huang, J., Yang, D., Tang, J., Hong, B., Ye, J., Dou, S., Zhang, M., Guan, J., Wu, W., Zheng, R., Gui, T., Zhang, Q., & Huang, X. (2026). MetaAct-RL: Training Language Models for Reasoning Through Meta-Action-Based Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 40(40), 34006-34015. https://doi.org/10.1609/aaai.v40i40.40694

Issue

Section

AAAI Technical Track on Natural Language Processing V