Eriksson, Maria, Erasmo Purificato, Arman Noroozian, João Vinagre, Guillaume Chaslot, Emilia Gomez, and David Fernandez-Llorca. “Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation”. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society 8, no. 1 (October 15, 2025): 850–864. Accessed May 29, 2026. https://ojs.aaai.org/index.php/AIES/article/view/36595.