Lightweight Adaptive Topological Layout and Semantic Mapping in Vision-and-Language Navigation on Websites

Authors

  • Pingrui Lai Shanghai Jiaotong University
  • Zihao Xie Shanghai Jiaotong University
  • Hua Yang Shanghai Jiaotong University

DOI:

https://doi.org/10.1609/aaai.v40i22.38901

Abstract

Vision-and-Language navigation on websites requires agents to navigate target webpages and answer questions based on human instructions. Current web agents primarily leverage Large Language Models (LLMs) for semantic understanding and reasoning, but still suffer from limited navigation performance and slow inference speed. Constructing a global map across webpages can effectively enhance both navigation accuracy and efficiency, however, this is challenged by the open structure of web navigation graphs and the dynamic nature of web layouts. In this paper, we propose ATLAS: Adaptive Topological Layout And Semantic mapping, a framework that adaptively constructs a time-varying, unbounded topological map across webpages and unifies heterogeneous elements through semantic representation. This enables both global path planning and local element selection for web-based navigation and question answering. As a lightweight approach, ATLAS significantly outperforms existing state-of-the-art methods on the WebVLN benchmark with a 10% improvement in success rate, and achieves the highest average task success rate on both the Mind2Web and WebArena benchmarks.

Downloads

Published

2026-03-14

How to Cite

Lai, P., Xie, Z., & Yang, H. (2026). Lightweight Adaptive Topological Layout and Semantic Mapping in Vision-and-Language Navigation on Websites. Proceedings of the AAAI Conference on Artificial Intelligence, 40(22), 18370–18378. https://doi.org/10.1609/aaai.v40i22.38901

Issue

Section

AAAI Technical Track on Intelligent Robotics