Introducing Visual Scenes and Reasoning: A More Realistic Benchmark for Spoken Language Understanding

Authors

  • Di Wu School of Computer Science and Technology, Xinjiang University, Urumqi, China Institute of Artificial Intelligence of China Telecom (TeleAI), Beijing, China
  • Liting Jiang School of Computer Science and Technology, Xinjiang University, Urumqi, China
  • Ruiyu Fang Institute of Artificial Intelligence of China Telecom (TeleAI), Beijing, China
  • Bianjing School of Computer Science and Technology, Xinjiang University, Urumqi, China
  • Hongyan Xie School of Computer, Beijing University of Aeronautics and Astronautics, Beijing, China
  • Haoxiang Su School of Computer Science and Technology, Xinjiang University, Urumqi, China
  • Hao Huang School of Computer Science and Technology, Xinjiang University, Urumqi, China Joint International Research Laboratory of Silk Road Multilingual Cognitive Computing, Urumqi, China
  • Zhongjiang He Institute of Artificial Intelligence of China Telecom (TeleAI), Beijing, China
  • Shuangyong Song Institute of Artificial Intelligence of China Telecom (TeleAI), Beijing, China
  • Xuelong Li Institute of Artificial Intelligence of China Telecom (TeleAI), Beijing, China

DOI:

https://doi.org/10.1609/aaai.v40i40.40682

Abstract

Spoken Language Understanding (SLU) consists of two sub-tasks: intent detection (ID) and slot filling (SF). Given its broad range of real-world applications, enhancing SLU for practical deployment is increasingly critical. Profile-based SLU addresses ambiguous user utterances by incorporating context awareness (CA), user profiles (UP), and knowledge graphs (KG) to support disambiguation, thereby advancing SLU research toward real-world applicability. However, existing SLU datasets still fall short in representing real-world scenarios. Specifically, (1) CA uses one-hot vectors for representation, which is overly idealized, and (2) models typically focuses solely on predicting intents and slot labels, neglecting the reasoning process that could enhance performance and interpretability. To overcome these limitations, we introduce VRSLU, a novel SLU dataset that integrates both Visual images and explicit Reasoning. For over-idealized CA, we use GPT-4o and FLUX.1-dev to generate images reflecting users’ environments and statuses, followed by human verification to ensure quality. For reasoning, GPT-4o is employed to generate explanations for predicted labels, which are then refined by human annotators to ensure accuracy and coherence. Additionally, we propose an instructional template, LR-Instruct, which first predicts labels and then generates corresponding reasoning. This two-step approach helps mitigate the influence of reasoning bias on label prediction. Experimental results confirm the effectiveness of incorporating visual information and highlight the promise of explicit reasoning in advancing SLU.

Downloads

Published

2026-03-14

How to Cite

Wu, D., Jiang, L., Fang, R., , B., Xie, H., Su, H., … Li, X. (2026). Introducing Visual Scenes and Reasoning: A More Realistic Benchmark for Spoken Language Understanding. Proceedings of the AAAI Conference on Artificial Intelligence, 40(40), 33899–33907. https://doi.org/10.1609/aaai.v40i40.40682

Issue

Section

AAAI Technical Track on Natural Language Processing V