[1]

H. Zhou, “Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory”, AAAI, vol. 40, no. 41, pp. 35085–35093, Mar. 2026.