GUI-Eyes: Tool-Augmented Perception for Visual Grounding in GUI Agents

Authors

  • Chen Chen University of Science and Technology of China Institute of Artificial Intelligence (TeleAI), China Telecom Shanghai Innovation Institute
  • Jiawei Shao Institute of Artificial Intelligence (TeleAI), China Telecom
  • Dakuan Lu Institute of Artificial Intelligence (TeleAI), China Telecom
  • Haoyi Hu Shanghai Jiaotong University
  • Xiangcheng Liu University of Science and Technology of China Shanghai Innovation Institute
  • Hantao Yao University of Science and Technology of China
  • Wu Liu University of Science and Technology of China

DOI:

https://doi.org/10.1609/aaai.v40i35.40175

Abstract

Recent advances in vision-language models (VLMs) and reinforcement learning (RL) have driven progress in GUI automation. However, most existing methods rely on static, one-shot visual inputs and passive perception, lacking the ability to adaptively determine when, whether, and how to observe the interface. We present GUI-Eyes, a reinforcement learning framework for active visual perception in GUI tasks. To acquire more informative observations, the agent learns to make strategic decisions on both whether and how to invoke visual tools, such as cropping or zooming, within a two-stage reasoning process. To support this behavior, we introduce a progressive perception strategy that decomposes the decision-making into coarse exploration and fine-grained grounding, coordinated by a two-level policy. In addition, we design a spatially continuous reward function tailored to tool usage, which integrates both location proximity and region overlap to provide dense supervision and alleviate the reward sparsity common in GUI environments. On the ScreenSpot-Pro benchmark, GUI-Eyes-3B achieves 44.8% grounding accuracy using only 3k labeled samples, significantly outperforming both supervised and RL-based baselines. These results highlight that tool-aware active perception, enabled by staged policy reasoning and fine-grained reward feedback, is critical for building robust and data-efficient GUI agents.

Downloads

Published

2026-03-14

How to Cite

Chen, C., Shao, J., Lu, D., Hu, H., Liu, X., Yao, H., & Liu, W. (2026). GUI-Eyes: Tool-Augmented Perception for Visual Grounding in GUI Agents. Proceedings of the AAAI Conference on Artificial Intelligence, 40(35), 29350-29358. https://doi.org/10.1609/aaai.v40i35.40175

Issue

Section

AAAI Technical Track on Multiagent Systems