Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory

Authors

  • Hongli Zhou Harbin Institute of Technology
  • Hui Huang Harbin Institute of Technology
  • Ziqing Zhao Harbin Institute of Technology
  • Lvyuan Han Harbin Institute of Technology
  • Huicheng Wang Harbin Institute of Technology
  • Kehai Chen Harbin Institute of Technology (Shenzhen)
  • Muyun Yang Harbin Institute of Technology
  • Wei Bao China Electronics Standardization Institute
  • Jian Dong China Electronics Standardization Institute
  • Bing Xu Harbin Institute of Technology
  • Conghui Zhu Harbin Institute of Technology
  • Hailong Cao Harbin Institute of Technology
  • Tiejun Zhao Harbin Institute of Technology

DOI:

https://doi.org/10.1609/aaai.v40i41.40814

Abstract

The evaluation of large language models (LLMs) via benchmarks is widespread, yet inconsistencies between different leaderboards and poor separability among top models raise concerns about their ability to accurately reflect authentic model capabilities. This paper provides a critical analysis of benchmark effectiveness, examining mainstream prominent LLM benchmarks using results from diverse models. We first propose Pseudo-Siamese Network for Item Response Theory (PSN-IRT), an enhanced Item Response Theory framework that incorporates a rich set of item parameters within an IRT-grounded architecture. PSN-IRT can be utilized for accurate and reliable estimations of item characteristics and model abilities. Based on PSN-IRT, we conduct extensive analysis on 11 LLM benchmarks comprising 41,871 items, revealing significant and varied shortcomings in their measurement quality. Furthermore, we demonstrate that leveraging PSN-IRT is able to construct smaller benchmarks while maintaining stronger alignment with human preference.

Published

2026-03-14

How to Cite

Zhou, H., Huang, H., Zhao, Z., Han, L., Wang, H., Chen, K., … Zhao, T. (2026). Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory. Proceedings of the AAAI Conference on Artificial Intelligence, 40(41), 35085–35093. https://doi.org/10.1609/aaai.v40i41.40814

Issue

Section

AAAI Technical Track on Natural Language Processing VI