What You See Is What You Reach: Towards Spatial Navigation with High-Level Human Instructions

Authors

  • Lingfeng Zhang Tsinghua Shenzhen International Graduate School, Tsinghua University Peng Cheng Laboratory Xiaomi EV
  • Haoxiang Fu National University of Singapore
  • Xiaoshuai Hao Xiaomi EV
  • Shuyi Zhang Institute of Automation, CAS
  • Qiang Zhang HKUSTGZ
  • Rui Liu Inner Mongolia University
  • Long Chen Xiaomi EV
  • Wenbo Ding Tsinghua Shenzhen International Graduate School, Tsinghua University

DOI:

https://doi.org/10.1609/aaai.v40i15.38258

Abstract

Embodied navigation is a fundamental capability that enables embodied agents to effectively interact with the physical world in various complex environments. However, a significant gap remains between current embodied navigation tasks and real-world requirements, as existing methods often struggle to integrate high-level human instructions with spatial understanding. To address this gap, we propose a new task of embodied navigation called spatial navigation, which encompasses two key components: spatial object navigation (SpON) for object-specific guidance and spatial area navigation (SpAN) for navigating to designated areas. Specifically, SpON guides agents to specific objects by leveraging spatial relationships and contextual understanding, while SpAN focuses on navigating to defined areas within complex environments. Together, these components significantly enhance agents’ navigation capabilities, enabling more effective interactions in real-world scenarios. To support this task, we have generated a spatial navigation dataset consisting of 10K trajectories within the simulator. This dataset includes high-level human instructions, detailed observations, and corresponding navigation actions, providing a comprehensive resource to enhance agent training and performance. Building on the spatial navigation dataset, we introduce SpNav, a hierarchical navigation framework. Specifically, SpNav employs vision-language model (VLM) to interpret high-level human instructions and accurately identify goal objects or areas within the observation range, achieving precise point-to-point navigation using a map and enhancing the agent’s ability to oper- ate effectively in complex environments by bridging the gap between perception and action. Extensive experiments show that SpNav achieves state-of-the-art (SOTA) performance in spatial navigation tasks across both simulated and real-world environments, validating the effectiveness of our method.

Downloads

Published

2026-03-14

How to Cite

Zhang, L., Fu, H., Hao, X., Zhang, S., Zhang, Q., Liu, R., … Ding, W. (2026). What You See Is What You Reach: Towards Spatial Navigation with High-Level Human Instructions. Proceedings of the AAAI Conference on Artificial Intelligence, 40(15), 12627–12635. https://doi.org/10.1609/aaai.v40i15.38258

Issue

Section

AAAI Technical Track on Computer Vision XII