VPN: Visual Prompt Navigation

Authors

  • Shuo Feng College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics The Key Laboratory of Brain-Machine Intelligence Technology, Ministry of Education
  • Zihan Wang National University of Singapore
  • Yuchen Li Baidu Inc.
  • Rui Kong Baidu Inc.
  • Hengyi Cai Baidu Inc.
  • Shuaiqiang Wang Baidu Inc.
  • Gim Hee Lee National University of Singapore
  • Piji Li College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics The Key Laboratory of Brain-Machine Intelligence Technology, Ministry of Education
  • Shuqiang Jiang University of Chinese Academy of Sciences, Beijing

DOI:

https://doi.org/10.1609/aaai.v40i22.38888

Abstract

While natural language is commonly used to guide embodied agents, the inherent ambiguity and verbosity of language often hinder the effectiveness of language-guided navigation in complex environments. To this end, we propose Visual Prompt Navigation (VPN), a novel paradigm that guides agents to navigate using only user-provided visual prompts within 2D top-view maps. This visual prompt primarily focuses on marking the visual navigation trajectory on a top-down view of a scene, offering intuitive and spatially grounded guidance without relying on language instructions. It is more friendly for non-expert users and reduces interpretive ambiguity. We build VPN tasks in both discrete and continuous navigation settings, constructing two new datasets, R2R-VP and R2R-CE-VP, by extending existing R2R and R2R-CE episodes with corresponding visual prompts. Furthermore, we introduce VPNet, a dedicated baseline network to handle the VPN tasks, with two data augmentation strategies: view-level augmentation (altering initial headings and prompt orientations) and trajectory-level augmentation (incorporating diverse trajectories from large-scale 3D scenes), to enhance navigation performance. Extensive experiments evaluate how visual prompt forms, top-view map formats, and data augmentation strategies affect the performance of visual prompt navigation.

Published

2026-03-14

How to Cite

Feng, S., Wang, Z., Li, Y., Kong, R., Cai, H., Wang, S., … Jiang, S. (2026). VPN: Visual Prompt Navigation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(22), 18253–18261. https://doi.org/10.1609/aaai.v40i22.38888

Issue

Section

AAAI Technical Track on Intelligent Robotics