On the Evaluation of Capability Estimation Methods for Large Language Models
DOI:
https://doi.org/10.1609/aaai.v40i37.40368Abstract
The emergence of large language models (LLMs) marks a transformative era in artificial intelligence~(AI). However, systematically evaluating the capability of LLMs is challenging due to the necessity of a large number of labeled test data. To tackle this problem, in the conventional AI field, AutoEval has been proposed to estimate the capability of AI models without data labeling effort. Unfortunately, even though multiple AutoEval methods have been proposed, most are constructed for classification tasks and evaluated only on image datasets. As a result, their effectiveness for LLMs is unclear, as LLMs often target generation tasks. In this work, we introduce the first AutoEval benchmark specifically designed to estimate the capability of LLMs using unlabeled test data, AEBench. Besides existing AutoEval methods, AEBench also supports our designed method, which utilizes the correlation between data uncertainty and model ability for the capability estimation. In total, AEBench covers 12 AutoEval methods and 120 method combinations. Based on AEBench, we conducted a comprehensive study to explore the usefulness of AutoEval on LLMs. Experimental results on 10 datasets demonstrated that our designed uncertainty features-based methods perform the best in achieving the lowest estimation errors.Downloads
Published
2026-03-14
How to Cite
Hu, Q., Wen, J., Zhang, Y., Cordy, M., & Lyu, Y. (2026). On the Evaluation of Capability Estimation Methods for Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(37), 31077–31085. https://doi.org/10.1609/aaai.v40i37.40368
Issue
Section
AAAI Technical Track on Natural Language Processing II