Agent Journey Beyond RGB: Hierarchical Semantic-Spatial Representation Enrichment for Vision-and-Language Navigation

Authors

  • Xuesong Zhang Hefei University of Technology
  • Yunbo Xu Hefei University of Technology
  • Jia Li Hefei University of Technology
  • Ruonan Liu Shanghai Jiao Tong University
  • Zhenzhen Hu Hefei University of Technology

DOI:

https://doi.org/10.1609/aaai.v40i22.38948

Abstract

Navigating unseen environments based on natural language instructions remains difficult for egocentric agents in Vision-and-Language Navigation (VLN). Intuitively, humans inherently ground concrete semantic knowledge within spatial layouts during indoor navigation. Although previous studies have introduced diverse environmental representations to enhance reasoning, other co-occurrence modalities are often naively concatenated with RGB features, resulting in suboptimal utilization of each modality's distinct contribution. Inspired by this, we propose a hierarchical Semantic Understanding and Spatial Awareness (SUSA) architecture to enable agents to perceive and ground environments at diverse scales. Specifically, the Textual Semantic Understanding (TSU) module supports local action prediction by generating view-level descriptions, thereby capturing fine-grained environmental semantics and narrowing the modality gap between instructions and environments. Complementarily, the Depth-enhanced Spatial Perception (DSP) module incrementally constructs a trajectory-level depth exploration map, providing the agent with a coarse-grained comprehension of the global spatial layout. Extensive experiments demonstrate that SUSA's hierarchical representation enrichment not only boosts the navigation performance of the baseline on discrete VLN benchmarks (REVERIE, R2R, and SOON), but also exhibits superior generalization to the continuous R2R-CE.

Downloads

Published

2026-03-14

How to Cite

Zhang, X., Xu, Y., Li, J., Liu, R., & Hu, Z. (2026). Agent Journey Beyond RGB: Hierarchical Semantic-Spatial Representation Enrichment for Vision-and-Language Navigation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(22), 18791–18799. https://doi.org/10.1609/aaai.v40i22.38948

Issue

Section

AAAI Technical Track on Intelligent Robotics