Lightweight Adaptive Topological Layout and Semantic Mapping in Vision-and-Language Navigation on Websites

Pingrui Lai; Zihao Xie; Hua Yang

doi:10.1609/aaai.v40i22.38901

Authors

Pingrui Lai Shanghai Jiaotong University
Zihao Xie Shanghai Jiaotong University
Hua Yang Shanghai Jiaotong University

DOI:

https://doi.org/10.1609/aaai.v40i22.38901

Abstract

Vision-and-Language navigation on websites requires agents to navigate target webpages and answer questions based on human instructions. Current web agents primarily leverage Large Language Models (LLMs) for semantic understanding and reasoning, but still suffer from limited navigation performance and slow inference speed. Constructing a global map across webpages can effectively enhance both navigation accuracy and efficiency, however, this is challenged by the open structure of web navigation graphs and the dynamic nature of web layouts. In this paper, we propose ATLAS: Adaptive Topological Layout And Semantic mapping, a framework that adaptively constructs a time-varying, unbounded topological map across webpages and unifies heterogeneous elements through semantic representation. This enables both global path planning and local element selection for web-based navigation and question answering. As a lightweight approach, ATLAS significantly outperforms existing state-of-the-art methods on the WebVLN benchmark with a 10% improvement in success rate, and achieves the highest average task success rate on both the Mind2Web and WebArena benchmarks.

Lightweight Adaptive Topological Layout and Semantic Mapping in Vision-and-Language Navigation on Websites

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information