[1]
D. Wu, “Introducing Visual Scenes and Reasoning: A More Realistic Benchmark for Spoken Language Understanding”, AAAI, vol. 40, no. 40, pp. 33899–33907, Mar. 2026.