Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory

Hongli Zhou; Hui Huang; Ziqing Zhao; Lvyuan Han; Huicheng Wang; Kehai Chen; Muyun Yang; Wei Bao; Jian Dong; Bing Xu; Conghui Zhu; Hailong Cao; Tiejun Zhao

doi:10.1609/aaai.v40i41.40814

Authors

Hongli Zhou Harbin Institute of Technology
Hui Huang Harbin Institute of Technology
Ziqing Zhao Harbin Institute of Technology
Lvyuan Han Harbin Institute of Technology
Huicheng Wang Harbin Institute of Technology
Kehai Chen Harbin Institute of Technology (Shenzhen)
Muyun Yang Harbin Institute of Technology
Wei Bao China Electronics Standardization Institute
Jian Dong China Electronics Standardization Institute
Bing Xu Harbin Institute of Technology
Conghui Zhu Harbin Institute of Technology
Hailong Cao Harbin Institute of Technology
Tiejun Zhao Harbin Institute of Technology

DOI:

https://doi.org/10.1609/aaai.v40i41.40814

Abstract

The evaluation of large language models (LLMs) via benchmarks is widespread, yet inconsistencies between different leaderboards and poor separability among top models raise concerns about their ability to accurately reflect authentic model capabilities. This paper provides a critical analysis of benchmark effectiveness, examining mainstream prominent LLM benchmarks using results from diverse models. We first propose Pseudo-Siamese Network for Item Response Theory (PSN-IRT), an enhanced Item Response Theory framework that incorporates a rich set of item parameters within an IRT-grounded architecture. PSN-IRT can be utilized for accurate and reliable estimations of item characteristics and model abilities. Based on PSN-IRT, we conduct extensive analysis on 11 LLM benchmarks comprising 41,871 items, revealing significant and varied shortcomings in their measurement quality. Furthermore, we demonstrate that leveraging PSN-IRT is able to construct smaller benchmarks while maintaining stronger alignment with human preference.

Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information