Beyond Step Pruning: Information Theory Based Step-level Optimization for Self-Refining Large Language Models

Authors

  • Jinman Zhao Department of Computer Science, University of Toronto
  • Erxue Min Baidu Inc
  • Hui Wu Aerospace Information Research Institute, Chinese Academy of Sciences
  • Ziheng Li Peking University
  • Zexu Sun Baidu Inc
  • Hengyi Cai Baidu Inc
  • Shuaiqiang Wang Baidu Inc
  • Xu Chen Renmin University of China
  • Gerald Penn Department of Computer Science, University of Toronto

DOI:

https://doi.org/10.1609/aaai.v40i41.40798

Abstract

Large language models (LLMs) have shown impressive capabilities in natural language tasks, yet they continue to struggle with multi-step mathematical reasoning, where correctness depends on a precise chain of intermediate steps. Preference optimization methods such as Direct Preference Optimization (DPO) have improved answer-level alignment, but they often overlook the reasoning process itself, providing little supervision over intermediate steps that are critical for complex problem-solving. Existing fine-grained approaches typically rely on strong annotators or reward models to assess the quality of individual steps. However, reward models are vulnerable to reward hacking. To address this, we propose ISLA, a reward-model-free framework that constructs step-level preference data directly from SFT gold traces. ISLA also introduces a self-improving pruning mechanism that identifies informative steps based on two signals: their marginal contribution to final accuracy (relative accuracy) and the model’s uncertainty, inspired by the concept of information gain. Empirically, ISLA achieves better performance than DPO while using only 12% of the training tokens, demonstrating that careful step-level selection can significantly improve both reasoning accuracy and training efficiency.

Downloads

Published

2026-03-14

How to Cite

Zhao, J., Min, E., Wu, H., Li, Z., Sun, Z., Cai, H., … Penn, G. (2026). Beyond Step Pruning: Information Theory Based Step-level Optimization for Self-Refining Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(41), 34941–34949. https://doi.org/10.1609/aaai.v40i41.40798

Issue

Section

AAAI Technical Track on Natural Language Processing VI