What You See Is What You Reach: Towards Spatial Navigation with High-Level Human Instructions

Lingfeng Zhang; Haoxiang Fu; Xiaoshuai Hao; Shuyi Zhang; Qiang Zhang; Rui Liu; Long Chen; Wenbo Ding

doi:10.1609/aaai.v40i15.38258

Authors

Lingfeng Zhang Tsinghua Shenzhen International Graduate School, Tsinghua University Peng Cheng Laboratory Xiaomi EV
Haoxiang Fu National University of Singapore
Xiaoshuai Hao Xiaomi EV
Shuyi Zhang Institute of Automation, CAS
Qiang Zhang HKUSTGZ
Rui Liu Inner Mongolia University
Long Chen Xiaomi EV
Wenbo Ding Tsinghua Shenzhen International Graduate School, Tsinghua University

DOI:

https://doi.org/10.1609/aaai.v40i15.38258

Abstract

Embodied navigation is a fundamental capability that enables embodied agents to effectively interact with the physical world in various complex environments. However, a significant gap remains between current embodied navigation tasks and real-world requirements, as existing methods often struggle to integrate high-level human instructions with spatial understanding. To address this gap, we propose a new task of embodied navigation called spatial navigation, which encompasses two key components: spatial object navigation (SpON) for object-specific guidance and spatial area navigation (SpAN) for navigating to designated areas. Specifically, SpON guides agents to specific objects by leveraging spatial relationships and contextual understanding, while SpAN focuses on navigating to defined areas within complex environments. Together, these components significantly enhance agents’ navigation capabilities, enabling more effective interactions in real-world scenarios. To support this task, we have generated a spatial navigation dataset consisting of 10K trajectories within the simulator. This dataset includes high-level human instructions, detailed observations, and corresponding navigation actions, providing a comprehensive resource to enhance agent training and performance. Building on the spatial navigation dataset, we introduce SpNav, a hierarchical navigation framework. Specifically, SpNav employs vision-language model (VLM) to interpret high-level human instructions and accurately identify goal objects or areas within the observation range, achieving precise point-to-point navigation using a map and enhancing the agent’s ability to oper- ate effectively in complex environments by bridging the gap between perception and action. Extensive experiments show that SpNav achieves state-of-the-art (SOTA) performance in spatial navigation tasks across both simulated and real-world environments, validating the effectiveness of our method.

What You See Is What You Reach: Towards Spatial Navigation with High-Level Human Instructions

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information