Li, X., Li, X., Hu, S., Guo, Y., & Zhang, W. (2026). VerifyBench: A Systematic Benchmark for Evaluating Reasoning Verifiers Across Domains. Proceedings of the AAAI Conference on Artificial Intelligence, 40(38), 31796–31804. https://doi.org/10.1609/aaai.v40i38.40448