Zhou, Hongli, Hui Huang, Ziqing Zhao, Lvyuan Han, Huicheng Wang, Kehai Chen, Muyun Yang, et al. 2026. “Lost in Benchmarks? Rethinking Large Language Model Benchmarking With Item Response Theory”. Proceedings of the AAAI Conference on Artificial Intelligence 40 (41):35085-93. https://doi.org/10.1609/aaai.v40i41.40814.