WebVLN: Vision-and-Language Navigation on Websites

Authors

  • Qi Chen The University of Adelaide
  • Dileepa Pitawela The University of Adelaide
  • Chongyang Zhao The University of Adelaide
  • Gengze Zhou The University of Adelaide
  • Hsiang-Ting Chen The University of Adelaide
  • Qi Wu The University of Adelaide

DOI:

https://doi.org/10.1609/aaai.v38i2.27878

Keywords:

CV: Language and Vision

Abstract

Vision-and-Language Navigation (VLN) task aims to enable AI agents to accurately understand and follow natural language instructions to navigate through real-world environments, ultimately reaching specific target locations. We recognise a promising opportunity to extend VLN to a comparable navigation task that holds substantial significance in our daily lives, albeit within the virtual realm: navigating websites on the Internet. This paper proposes a new task named Vision-and-Language Navigation on Websites (WebVLN), where we use question-based instructions to train an agent, emulating how users naturally browse websites. Unlike the existing VLN task that only pays attention to vision and instruction (language), the WebVLN agent further considers underlying web-specific content like HTML, which could not be seen on the rendered web pages yet contain rich visual and textual information. Toward this goal, we contribute a dataset, WebVLN-v1, and introduce a novel approach called Website-aware VLN Network (WebVLN-Net), which is built upon the foundation of state-of-the-art VLN techniques. Experimental results show that WebVLN-Net outperforms current VLN and web-related navigation methods. We believe that the introduction of the newWebVLN task and its dataset will establish a new dimension within the VLN domain and contribute to the broader vision-and-language research community. Code is available at: https://github.com/WebVLN/WebVLN.

Published

2024-03-24

How to Cite

Chen, Q., Pitawela, D., Zhao, C., Zhou, G., Chen, H.-T., & Wu, Q. (2024). WebVLN: Vision-and-Language Navigation on Websites. Proceedings of the AAAI Conference on Artificial Intelligence, 38(2), 1165–1173. https://doi.org/10.1609/aaai.v38i2.27878

Issue

Section

AAAI Technical Track on Computer Vision I