On the Evaluation of Capability Estimation Methods for Large Language Models

Qiang Hu; Jin Wen; Yao Zhang; Maxime Cordy; Yongqiang Lyu

doi:10.1609/aaai.v40i37.40368

Authors

Qiang Hu Tianjin University
Jin Wen University of Luxemburg
Yao Zhang Tianjin University
Maxime Cordy University of Luxemburg
Yongqiang Lyu Tianjin University

DOI:

https://doi.org/10.1609/aaai.v40i37.40368

Abstract

The emergence of large language models (LLMs) marks a transformative era in artificial intelligence~(AI). However, systematically evaluating the capability of LLMs is challenging due to the necessity of a large number of labeled test data. To tackle this problem, in the conventional AI field, AutoEval has been proposed to estimate the capability of AI models without data labeling effort. Unfortunately, even though multiple AutoEval methods have been proposed, most are constructed for classification tasks and evaluated only on image datasets. As a result, their effectiveness for LLMs is unclear, as LLMs often target generation tasks. In this work, we introduce the first AutoEval benchmark specifically designed to estimate the capability of LLMs using unlabeled test data, AEBench. Besides existing AutoEval methods, AEBench also supports our designed method, which utilizes the correlation between data uncertainty and model ability for the capability estimation. In total, AEBench covers 12 AutoEval methods and 120 method combinations. Based on AEBench, we conducted a comprehensive study to explore the usefulness of AutoEval on LLMs. Experimental results on 10 datasets demonstrated that our designed uncertainty features-based methods perform the best in achieving the lowest estimation errors.

On the Evaluation of Capability Estimation Methods for Large Language Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information