Liang, Z. (2026) “How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation Under the One-Time-Pad-Based Framework”, Proceedings of the AAAI Conference on Artificial Intelligence, 40(44), pp. 37636–37644. doi: 10.1609/aaai.v40i44.41098.