Beyond Step Pruning: Information Theory Based Step-level Optimization for Self-Refining Large Language Models
DOI:
https://doi.org/10.1609/aaai.v40i41.40798Abstract
Large language models (LLMs) have shown impressive capabilities in natural language tasks, yet they continue to struggle with multi-step mathematical reasoning, where correctness depends on a precise chain of intermediate steps. Preference optimization methods such as Direct Preference Optimization (DPO) have improved answer-level alignment, but they often overlook the reasoning process itself, providing little supervision over intermediate steps that are critical for complex problem-solving. Existing fine-grained approaches typically rely on strong annotators or reward models to assess the quality of individual steps. However, reward models are vulnerable to reward hacking. To address this, we propose ISLA, a reward-model-free framework that constructs step-level preference data directly from SFT gold traces. ISLA also introduces a self-improving pruning mechanism that identifies informative steps based on two signals: their marginal contribution to final accuracy (relative accuracy) and the model’s uncertainty, inspired by the concept of information gain. Empirically, ISLA achieves better performance than DPO while using only 12% of the training tokens, demonstrating that careful step-level selection can significantly improve both reasoning accuracy and training efficiency.Downloads
Published
2026-03-14
How to Cite
Zhao, J., Min, E., Wu, H., Li, Z., Sun, Z., Cai, H., … Penn, G. (2026). Beyond Step Pruning: Information Theory Based Step-level Optimization for Self-Refining Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(41), 34941–34949. https://doi.org/10.1609/aaai.v40i41.40798
Issue
Section
AAAI Technical Track on Natural Language Processing VI