Reasoning with Exploration: An Entropy Perspective

Authors

  • Daixuan Cheng Gaoling School of Artificial Intelligence, Renmin University of China Beijing Institute for General Artificial Intelligence Beijing Key Laboratory of Research on Large Models and Intelligent Governance
  • Shaohan Huang Microsoft Research
  • Xuekai Zhu Shanghai Jiaotong University
  • Bo Dai Beijing Institute for General Artificial Intelligence
  • Xin Zhao Gaoling School of Artificial Intelligence, Renmin University of China Beijing Key Laboratory of Research on Large Models and Intelligent Governance
  • Zhenliang Zhang Beijing Institute for General Artificial Intelligence
  • Furu Wei Microsoft Research

DOI:

https://doi.org/10.1609/aaai.v40i36.40290

Abstract

Balancing exploration and exploitation is a central goal in reinforcement learning (RL). Despite recent advances in enhancing language model (LM) reasoning, most methods lean toward exploitation, and increasingly encounter performance plateaus. In this work, we revisit entropy -- a signal of exploration in RL -- and examine its relationship to exploratory reasoning in LMs. Through empirical analysis, we uncover positive correlations between high-entropy regions and three types of exploratory reasoning actions: (1) pivotal tokens that determine or connect logical steps, (2) reflective actions such as self-verification and correction, and (3) rare behaviors under-explored by the base LMs. Motivated by this, we introduce a minimal modification to standard RL with only one line of code: augmenting the advantage function with an entropy-based term. Unlike traditional maximum-entropy methods which encourage exploration by promoting uncertainty, we encourage exploration by promoting deeper and longer reasoning chains. Notably, our method achieves significant gains on the Pass@K metric -- an upper-bound estimator of LM reasoning capabilities -- even when evaluated with extremely large K values, pushing the boundaries of LM reasoning.

Downloads

Published

2026-03-14

How to Cite

Cheng, D., Huang, S., Zhu, X., Dai, B., Zhao, X., Zhang, Z., & Wei, F. (2026). Reasoning with Exploration: An Entropy Perspective. Proceedings of the AAAI Conference on Artificial Intelligence, 40(36), 30377-30385. https://doi.org/10.1609/aaai.v40i36.40290

Issue

Section

AAAI Technical Track on Natural Language Processing I