VPN: Visual Prompt Navigation

Shuo Feng; Zihan Wang; Yuchen Li; Rui Kong; Hengyi Cai; Shuaiqiang Wang; Gim Hee Lee; Piji Li; Shuqiang Jiang

doi:10.1609/aaai.v40i22.38888

Authors

Shuo Feng College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics The Key Laboratory of Brain-Machine Intelligence Technology, Ministry of Education
Zihan Wang National University of Singapore
Yuchen Li Baidu Inc.
Rui Kong Baidu Inc.
Hengyi Cai Baidu Inc.
Shuaiqiang Wang Baidu Inc.
Gim Hee Lee National University of Singapore
Piji Li College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics The Key Laboratory of Brain-Machine Intelligence Technology, Ministry of Education
Shuqiang Jiang University of Chinese Academy of Sciences, Beijing

DOI:

https://doi.org/10.1609/aaai.v40i22.38888

Abstract

While natural language is commonly used to guide embodied agents, the inherent ambiguity and verbosity of language often hinder the effectiveness of language-guided navigation in complex environments. To this end, we propose Visual Prompt Navigation (VPN), a novel paradigm that guides agents to navigate using only user-provided visual prompts within 2D top-view maps. This visual prompt primarily focuses on marking the visual navigation trajectory on a top-down view of a scene, offering intuitive and spatially grounded guidance without relying on language instructions. It is more friendly for non-expert users and reduces interpretive ambiguity. We build VPN tasks in both discrete and continuous navigation settings, constructing two new datasets, R2R-VP and R2R-CE-VP, by extending existing R2R and R2R-CE episodes with corresponding visual prompts. Furthermore, we introduce VPNet, a dedicated baseline network to handle the VPN tasks, with two data augmentation strategies: view-level augmentation (altering initial headings and prompt orientations) and trajectory-level augmentation (incorporating diverse trajectories from large-scale 3D scenes), to enhance navigation performance. Extensive experiments evaluate how visual prompt forms, top-view map formats, and data augmentation strategies affect the performance of visual prompt navigation.

VPN: Visual Prompt Navigation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information