VPN: Visual Prompt Navigation
DOI:
https://doi.org/10.1609/aaai.v40i22.38888Abstract
While natural language is commonly used to guide embodied agents, the inherent ambiguity and verbosity of language often hinder the effectiveness of language-guided navigation in complex environments. To this end, we propose Visual Prompt Navigation (VPN), a novel paradigm that guides agents to navigate using only user-provided visual prompts within 2D top-view maps. This visual prompt primarily focuses on marking the visual navigation trajectory on a top-down view of a scene, offering intuitive and spatially grounded guidance without relying on language instructions. It is more friendly for non-expert users and reduces interpretive ambiguity. We build VPN tasks in both discrete and continuous navigation settings, constructing two new datasets, R2R-VP and R2R-CE-VP, by extending existing R2R and R2R-CE episodes with corresponding visual prompts. Furthermore, we introduce VPNet, a dedicated baseline network to handle the VPN tasks, with two data augmentation strategies: view-level augmentation (altering initial headings and prompt orientations) and trajectory-level augmentation (incorporating diverse trajectories from large-scale 3D scenes), to enhance navigation performance. Extensive experiments evaluate how visual prompt forms, top-view map formats, and data augmentation strategies affect the performance of visual prompt navigation.Downloads
Published
2026-03-14
How to Cite
Feng, S., Wang, Z., Li, Y., Kong, R., Cai, H., Wang, S., … Jiang, S. (2026). VPN: Visual Prompt Navigation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(22), 18253–18261. https://doi.org/10.1609/aaai.v40i22.38888
Issue
Section
AAAI Technical Track on Intelligent Robotics