MetaEval: Measuring the Discrimination of Benchmarks for Efficient LLM Evaluation

Authors

  • Zhuo Wang East China Normal University
  • Wen Wu East China Normal University
  • Guoqing Wang East China Normal University
  • Guangze Ye East China Normal University
  • Zhenxiao Cheng East China Normal University

DOI:

https://doi.org/10.1609/aaai.v40i40.40668

Abstract

Benchmarks serve as standardized test systems to distinguish capabilities among large language models (LLMs). Discriminative items enable high-ability LLMs to favor correct answers, while causing low-ability models to assign lower plausibility to these answers and tend toward incorrect answers. Current methods for assessing benchmark quality primarily focus on coverage of difficulty levels and task diversity, yet lack direct quantification of discrimination—the core metric. Furthermore, large-scale benchmarks incur high evaluation costs. Although heuristic methods can reduce item counts to some extent, they cannot guarantee preservation of the benchmark’s original discriminative properties. To address these limitations, we propose MetaEval, a meta-evaluation framework designed to precisely quantify per-item discrimination and enable efficient assessment. Central to MetaEval is our novel Signal Detection and Item Response (SD-IR) model, which simulates LLMs’ detection of correct answers (signals) by representing each model’s perception through two latent ability states: “known” and “unknown”. For any item, discrimination is quantified as the difference in signal plausibility between these states. Leveraging these discrimination metrics, MetaEval introduces two strategies to replicate full-benchmark results using minimal subsets for efficient evaluation: (1) Distilling metaBench: a compact subset that retains discriminative power by removing redundant items; (2) Predicting performance on full-benchmark based on metaBench’s discrimination. Experiments across five benchmarks confirm that high-discrimination items capture greater performance variation among LLMs, align more closely with full-benchmark rankings, and exhibit superior predictive ability. Notably, in the best case, MetaEval achieves accurate full-benchmark estimation using only 2.5% of items, substantially reducing evaluation costs while preserving reliability.

Downloads

Published

2026-03-14

How to Cite

Wang, Z., Wu, W., Wang, G., Ye, G., & Cheng, Z. (2026). MetaEval: Measuring the Discrimination of Benchmarks for Efficient LLM Evaluation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(40), 33773–33781. https://doi.org/10.1609/aaai.v40i40.40668

Issue

Section

AAAI Technical Track on Natural Language Processing V