Wu, Di, et al. “Introducing Visual Scenes and Reasoning: A More Realistic Benchmark for Spoken Language Understanding”. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 40, Mar. 2026, pp. 33899-07, doi:10.1609/aaai.v40i40.40682.