When Instinct Guides and Insight Grounds: Staged RL Training for LLM Agents

Zijing Zhang; Boning Zhang

doi:10.1609/aaai.v40i41.40794

Authors

Zijing Zhang Peking University
Boning Zhang Institute of Automation, Chinese Academy of Sciences

DOI:

https://doi.org/10.1609/aaai.v40i41.40794

Abstract

Large Language Model (LLM) agents have demonstrated strong potential in complex, interactive decision-making tasks. However, when training LLM agents end-to-end with reinforcement learning (RL), efficiently optimizing agent policies in dynamic environments remains a significant challenge. Existing RL-based LLM agent paradigms commonly organize interactions in a cycle where reasoning is followed by action. In our work, we observe a phenomenon we call Exploration Contraction, where the explicit introduction of a reasoning stage reduces the diversity of actions—quantified by lower action entropy—which in turn limits exploration and leads to premature policy convergence. To address this limitation, we propose Act-before-Reasoning (ActRe), a two-stage RL training framework. In the first stage, we reverse the typical rollout order, prompting the agent to generate actions prior to reasoning, which encourages exploration driven by model intuition. In the second stage, we restore the standard reasoning-then-action order for training and evaluation, ensuring robust and interpretable decision-making. Experiments on the ALFWorld and WebShop benchmarks show that ActRe effectively mitigates exploration contraction, yielding consistently higher task success rates and improved training robustness compared to strong RL baselines. Our analysis underscores the importance of action entropy in the exploration-exploitation trade-off during LLM agent training and provides a practical approach to maintain the benefits of explicit reasoning while promoting sufficient exploration.

When Instinct Guides and Insight Grounds: Staged RL Training for LLM Agents

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information