Liang, Z., Yu, L., Shiyu, Z., Ye, Q., & Hu, H. (2026). How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation Under the One-Time-Pad-Based Framework. Proceedings of the AAAI Conference on Artificial Intelligence, 40(44), 37636–37644. https://doi.org/10.1609/aaai.v40i44.41098