Enhancing Spatial Reasoning Through Visual and Textual Thinking

Authors

  • Xun Liang State Key Lab of CAD\&CG, Zhejiang University Alibaba Cloud Computing
  • Xin Guo Alibaba Cloud Computing
  • Zhongming Jin Alibaba Cloud Computing
  • Weihang Pan School of Software Technology, Zhejiang University
  • Penghui Shang Xihu, Hangzhou Zhiyuan Research Institute Co., Ltd
  • Deng Cai State Key Lab of CAD\&CG, Zhejiang University
  • Binbin Lin School of Software Technology, Zhejiang University
  • Jieping Ye Alibaba Cloud Computing

DOI:

https://doi.org/10.1609/aaai.v40i28.39514

Abstract

The spatial reasoning task aims to reason about the spatial relationships in 2D and 3D space, which is a fundamental capability for Visual Question Answering (VQA) and robotics. Although vision language models (VLMs) have developed rapidly in recent years, they are still struggling with the spatial reasoning task. In this paper, we introduce a method that can enhance Spatial reasoning through Visual and Textual thinking Simultaneously (SpatialVTS). In the spatial visual thinking phase, our model is trained to generate location-related specific tokens of important targets automatically. Not only are the objects mentioned in the problem addressed, but also the potential objects related to the reasoning are considered. During the spatial textual thinking phase, our model conducts long-term thinking based on visual cues and dialogues and gradually inferences the answers to spatial reasoning problems. To effectively support the model's training, we made manual corrections to the existing spatial reasoning dataset, eliminating numerous incorrect labels resulting from automatic annotation, restructuring the data input format to enhance generalization, and developing a reasoning framework for model thinking. Without introducing any additional information (such as masks or depth), our model's overall average level in several spatial understanding tasks has significantly improved compared with other models.

Published

2026-03-14

How to Cite

Liang, X., Guo, X., Jin, Z., Pan, W., Shang, P., Cai, D., … Ye, J. (2026). Enhancing Spatial Reasoning Through Visual and Textual Thinking. Proceedings of the AAAI Conference on Artificial Intelligence, 40(28), 23433–23441. https://doi.org/10.1609/aaai.v40i28.39514

Issue

Section

AAAI Technical Track on Machine Learning V