Attention-Driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models Without Fine-Tuning

Authors

  • Hai-Ming Xu Australian Institute for Machine Learning, The University of Adelaide
  • Qi Chen Australian Institute for Machine Learning, The University of Adelaide
  • Lei Wang University of Wollonong
  • Lingqiao Liu Australian Institute for Machine Learning, The University of Adelaide

DOI:

https://doi.org/10.1609/aaai.v39i8.32957

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have generated significant interest in their ability to autonomously interact with and interpret Graphical User Interfaces (GUIs). A major challenge in these systems is grounding—accurately identifying critical GUI components such as text or icons based on a GUI image and a corresponding text query. Traditionally, this task has relied on fine-tuning MLLMs with specialized training data to predict component locations directly. However, in this paper, we propose a novel Tuning-free Attention-driven Grounding (TAG) method that leverages the inherent attention patterns in pretrained MLLMs to accomplish this task without the need for additional fine-tuning. Our method involves identifying and aggregating attention maps from specific tokens within a carefully constructed query prompt. Applied to MiniCPM-Llama3-V 2.5, a state-of-the-art MLLM, our tuning-free approach achieves performance comparable to tuning-based methods, with notable success in text localization. Additionally, we demonstrate that our attention map-based grounding technique significantly outperforms direct localization predictions from MiniCPM-Llama3-V 2.5, highlighting the potential of using attention maps from pretrained MLLMs and paving the way for future innovations in this domain.

Published

2025-04-11

How to Cite

Xu, H.-M., Chen, Q., Wang, L., & Liu, L. (2025). Attention-Driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models Without Fine-Tuning. Proceedings of the AAAI Conference on Artificial Intelligence, 39(8), 8851–8859. https://doi.org/10.1609/aaai.v39i8.32957

Issue

Section

AAAI Technical Track on Computer Vision VII