InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization

Yuhang Liu; Zeyu Liu; Shuanghe Zhu; Pengxiang Li; Congkai Xie; Jiasheng Wang; Xueyu Hu; Xiaotian Han; Jianbo Yuan; Xinyao Wang; Shengyu Zhang; Hongxia Yang; Fei Wu

doi:10.1609/aaai.v40i38.40500

Authors

Yuhang Liu Zhejiang University, Hangzhou, Zhejiang, China InfiX.ai, Hong Kong, China
Zeyu Liu The Hong Kong Polytechnic University, Hong Kong, China
Shuanghe Zhu Zhejiang University, Hangzhou, Zhejiang, China
Pengxiang Li The Hong Kong Polytechnic University, Hong Kong, China
Congkai Xie InfiX.ai, Hong Kong, China
Jiasheng Wang The University of Chicago, Chicago, IL, USA InfiX.ai, Hong Kong, China
Xueyu Hu Zhejiang University, Hangzhou, Zhejiang, China
Xiaotian Han Independent Researcher
Jianbo Yuan Amazon, Seattle, WA, USA
Xinyao Wang Amazon, Seattle, WA, USA
Shengyu Zhang Zhejiang University, Hangzhou, Zhejiang, China
Hongxia Yang The Hong Kong Polytechnic University, Hong Kong, China InfiX.ai, Hong Kong, China
Fei Wu Zhejiang University, Hangzhou, Zhejiang, China

DOI:

https://doi.org/10.1609/aaai.v40i38.40500

Abstract

The emergence of Multimodal Large Language Models (MLLMs) has propelled the development of autonomous agents that operate on Graphical User Interfaces (GUIs) using pure visual input. A fundamental challenge is robustly grounding natural language instructions. This requires a precise spatial alignment, which accurately locates the coordinates of each element, and, more critically, a correct semantic alignment, which matches the instructions to the functionally appropriate UI element. Although Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be effective at improving spatial alignment for these MLLMs, we find that inefficient exploration bottlenecks semantic alignment, which prevents models from learning difficult semantic associations. To address this exploration problem, we present Adaptive Exploration Policy Optimization (AEPO), a new policy optimization framework. AEPO employs a multi-answer generation strategy to enforce broader exploration, which is then guided by a theoretically grounded Adaptive Exploration Reward (AER) function derived from first principles of efficiency η=U/C. Our AEPO-trained models, InfiGUI-G1-3B and InfiGUI-G1-7B, establish new state-of-the-art results across multiple challenging GUI grounding benchmarks, achieving significant relative improvements of up to 9.0% against the naive RLVR baseline on benchmarks designed to test generalization and semantic understanding.

InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information