Rectify Evaluation Preference: Improving LLMs’ Critique on Math Reasoning via Perplexity-aware Reinforcement Learning

Authors

  • Changyuan Tian Aerospace Information Research Institute, Chinese Academy of Sciences Key Laboratory of Target Cognition and Application Technology (TCAT) School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences
  • Zhicong Lu Aerospace Information Research Institute, Chinese Academy of Sciences Key Laboratory of Target Cognition and Application Technology (TCAT) School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences
  • Shuang Qian Meituan
  • Nayu Liu School of Computer Science and Technology, Tiangong University
  • Peiguang Li Meituan
  • Li Jin Aerospace Information Research Institute, Chinese Academy of Sciences
  • Leiyi Hu Aerospace Information Research Institute, Chinese Academy of Sciences Key Laboratory of Target Cognition and Application Technology (TCAT) School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences
  • Zhizhao Zeng Meituan
  • Sirui Wang Meituan
  • Ke Zeng Meituan
  • Guozhi Cas Aerospace Information Research Institute, Chinese Academy of Sciences

DOI:

https://doi.org/10.1609/aaai.v40i39.40609

Abstract

To improve Multi-step Mathematical Reasoning (MsMR) of Large Language Models (LLMs), it is crucial to obtain scalable supervision from the corpus by automatically critiquing mistakes in the reasoning process of MsMR and rendering a final verdict of the problem-solution. Most existing methods rely on crafting high-quality supervised fine-tuning demonstrations for critiquing capability enhancement and pay little attention to delving into the underlying reason for the poor critiquing performance of LLMs. In this paper, we orthogonally quantify and investigate the potential reason — imbalanced evaluation preference, and conduct a statistical preference analysis. Motivated by the analysis of the reason, a novel perplexity-aware reinforcement learning algorithm is proposed to rectify the evaluation preference, elevating the critiquing capability. Specifically, to probe into LLMs' critiquing characteristics, a One-to-many Problem-Solution (OPS) benchmark is meticulously constructed to quantify the behavior difference of LLMs when evaluating the problem solutions generated by itself and others. Then, to investigate the behavior difference in depth, we conduct a statistical preference analysis oriented on perplexity and find an intriguing phenomenon — "LLMs incline to judge solutions with lower perplexity as correct", which is dubbed as imbalanced evaluation preference. To rectify this preference, we regard perplexity as the baton in the algorithm of Group Relative Policy Optimization, supporting the LLMs to explore trajectories that judge lower perplexity as wrong and higher perplexity as correct. Extensive experimental results on our built OPS and existing available critic benchmarks demonstrate the validity of our method.

Downloads

Published

2026-03-14

How to Cite

Tian, C., Lu, Z., Qian, S., Liu, N., Li, P., Jin, L., … Cas, G. (2026). Rectify Evaluation Preference: Improving LLMs’ Critique on Math Reasoning via Perplexity-aware Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 40(39), 33241–33249. https://doi.org/10.1609/aaai.v40i39.40609

Issue

Section

AAAI Technical Track on Natural Language Processing IV