Auto-PRE: An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation

Authors

  • Junjie Chen Department of Computer Science and Technology, Tsinghua University Quan Cheng Laboratory
  • Weihang Su Department of Computer Science and Technology, Tsinghua University
  • Zhumin Chu Department of Computer Science and Technology, Tsinghua University
  • Haitao Li Department of Computer Science and Technology, Tsinghua University
  • Yujia Zhou Department of Computer Science and Technology, Tsinghua University
  • Dingbo Yuan Ant Group
  • Xudong Wang Ant Group
  • Jun Zhou Ant Group
  • Yiqun Liu Department of Computer Science and Technology, Tsinghua University
  • Min Zhang Department of Computer Science and Technology, Tsinghua University
  • Shaoping Ma Department of Computer Science and Technology, Tsinghua University
  • Qingyao Ai Quan Cheng Laboratory Department of Computer Science and Technology, Tsinghua University

DOI:

https://doi.org/10.1609/aaai.v40i36.40274

Abstract

The rapid development of large language models (LLMs) has highlighted the need for efficient and reliable methods to evaluate their performance. Traditional evaluation methods often face challenges like high costs, limited task formats, dependence on human references, and systematic biases. To address these limitations, we propose Auto-PRE, an automatic LLM evaluation framework inspired by the peer review process. Unlike previous approaches that rely on human annotations, Auto-PRE automatically selects evaluator LLMs based on three core traits: consistency, pertinence, and self-confidence, which correspond to the instruction, content, and response stages, respectively, and collectively cover the entire evaluation process. Experiments on three representative tasks, including summarization, non-factoid QA, and dialogue generation, demonstrate that Auto-PRE achieves state-of-the-art performance while significantly reducing evaluation costs. Furthermore, the structured and scalable design of our automatic qualification exam framework provides valuable insights into automating the evaluation of LLMs-as-judges, paving the way for more advanced LLM-based evaluation frameworks.

Downloads

Published

2026-03-14

How to Cite

Chen, J., Su, W., Chu, Z., Li, H., Zhou, Y., Yuan, D., … Ai, Q. (2026). Auto-PRE: An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(36), 30235–30242. https://doi.org/10.1609/aaai.v40i36.40274

Issue

Section

AAAI Technical Track on Natural Language Processing I