[1]
X. Li, X. Li, S. Hu, Y. Guo, and W. Zhang, “VerifyBench: A Systematic Benchmark for Evaluating Reasoning Verifiers Across Domains”, AAAI, vol. 40, no. 38, pp. 31796–31804, Mar. 2026.