MetaEval: Measuring the Discrimination of Benchmarks for Efficient LLM Evaluation

Zhuo Wang; Wen Wu; Guoqing Wang; Guangze Ye; Zhenxiao Cheng

doi:10.1609/aaai.v40i40.40668

Authors

Zhuo Wang East China Normal University
Wen Wu East China Normal University
Guoqing Wang East China Normal University
Guangze Ye East China Normal University
Zhenxiao Cheng East China Normal University

DOI:

https://doi.org/10.1609/aaai.v40i40.40668

Abstract

Benchmarks serve as standardized test systems to distinguish capabilities among large language models (LLMs). Discriminative items enable high-ability LLMs to favor correct answers, while causing low-ability models to assign lower plausibility to these answers and tend toward incorrect answers. Current methods for assessing benchmark quality primarily focus on coverage of difficulty levels and task diversity, yet lack direct quantification of discrimination—the core metric. Furthermore, large-scale benchmarks incur high evaluation costs. Although heuristic methods can reduce item counts to some extent, they cannot guarantee preservation of the benchmark’s original discriminative properties. To address these limitations, we propose MetaEval, a meta-evaluation framework designed to precisely quantify per-item discrimination and enable efficient assessment. Central to MetaEval is our novel Signal Detection and Item Response (SD-IR) model, which simulates LLMs’ detection of correct answers (signals) by representing each model’s perception through two latent ability states: “known” and “unknown”. For any item, discrimination is quantified as the difference in signal plausibility between these states. Leveraging these discrimination metrics, MetaEval introduces two strategies to replicate full-benchmark results using minimal subsets for efficient evaluation: (1) Distilling metaBench: a compact subset that retains discriminative power by removing redundant items; (2) Predicting performance on full-benchmark based on metaBench’s discrimination. Experiments across five benchmarks confirm that high-discrimination items capture greater performance variation among LLMs, align more closely with full-benchmark rankings, and exhibit superior predictive ability. Notably, in the best case, MetaEval achieves accurate full-benchmark estimation using only 2.5% of items, substantially reducing evaluation costs while preserving reliability.

MetaEval: Measuring the Discrimination of Benchmarks for Efficient LLM Evaluation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information