Wu, Di, Liting Jiang, Ruiyu Fang, Bianjing, Hongyan Xie, Haoxiang Su, Hao Huang, Zhongjiang He, Shuangyong Song, and Xuelong Li. “Introducing Visual Scenes and Reasoning: A More Realistic Benchmark for Spoken Language Understanding”. Proceedings of the AAAI Conference on Artificial Intelligence 40, no. 40 (March 14, 2026): 33899–33907. Accessed May 14, 2026. https://ojs.aaai.org/index.php/AAAI/article/view/40682.