Li, X., Lan, Y., & Yang, C. (2025). TreeEval: Benchmark-Free Evaluation of Large Language Models through Tree Planning. Proceedings of the AAAI Conference on Artificial Intelligence, 39(23), 24485–24493. https://doi.org/10.1609/aaai.v39i23.34627