PagePilot for Web Automation Based on a Multi-Agent Architecture
DOI:
https://doi.org/10.1609/icwsm.v20i1.42771Abstract
Recent advances in reasoning and multi-modal analysis have enabled large language models (LLMs) to automate tasks like extracting data and filling out forms for web navigation. However, existing systems struggle with complex, long-form web interactions, such as parsing lengthy articles or optimizing multi-step workflows, due to limitations in processing dynamic content and scalable architecture. To address this gap, we introduce PagePilot, an automated web control system that integrates DOM source code analysis and multi-agent collaboration for robust task execution. Building on WebVoyager, PagePilot incorporates dynamic loading to handle infinite-scroll pages and asynchronous content, alongside a hierarchical reasoning framework for efficient decision-making. Evaluations demonstrate state-of-the-art performance: a 76% task success rate on WebVoyager (vs. 63% baseline) and 47% on GAIA (vs. 38% for GPT-4). To validate generalization, we curated benchmarks from Mind2Web and a novel Chinese web dataset, achieving 52% and 70% success rates, respectively, outperforming prior methods in linguistically diverse scenarios. Ablation studies confirm critical design choices, showing a 22% increase in task completion and 27% reduction in redundant actions compared to fragmentary approaches.Downloads
Published
2026-05-25
How to Cite
Yeh, C.-J., & Chang, C.-H. (2026). PagePilot for Web Automation Based on a Multi-Agent Architecture. Proceedings of the International AAAI Conference on Web and Social Media, 20(1), 2623–2636. https://doi.org/10.1609/icwsm.v20i1.42771
Issue
Section
Full Papers