Liang, Zi, et al. “How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation Under the One-Time-Pad-Based Framework”. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 44, Mar. 2026, pp. 37636-44, doi:10.1609/aaai.v40i44.41098.