UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

Zhengxi Lu; Yuxiang Chai; Yaxuan Guo; Xi Yin; Liang Liu; Hao Wang; Han Xiao; Shuai Ren; Pengxiang Zhao; Guangyi Liu; Guanjing Xiong; Hongsheng Li

doi:10.1609/aaai.v40i21.38816

Authors

Zhengxi Lu Zhejiang University vivo AI Lab
Yuxiang Chai MMLab @ CUHK
Yaxuan Guo vivo AI Lab
Xi Yin vivo AI Lab
Liang Liu vivo AI Lab
Hao Wang vivo AI Lab
Han Xiao MMLab @ CUHK
Shuai Ren vivo AI Lab
Pengxiang Zhao Zhejiang University
Guangyi Liu Zhejiang University
Guanjing Xiong vivo AI Lab
Hongsheng Li MMLab @ CUHK

DOI:

https://doi.org/10.1609/aaai.v40i21.38816

Abstract

The recent DeepSeek-R1 has showcased the emergence of reasoning capabilities in large language models (LLMs) through reinforcement learning (RL) with rule-based rewards. Despite its success in language tasks, its application in multimodal domains, particularly in graphic user interface (GUI) agent tasks, remains under-explored. To address this gap, we propose UI-R1, the first framework to investigate how rule-based RL can enhance the reasoning capabilities of multimodal large language models (MLLMs) for GUI action prediction tasks. UI-R1 introduces a novel rule-based action reward scheme, enabling model optimization via policy-based algorithms such as Group Relative Policy Optimization (GRPO). To further improve efficiency at inference time, we present UI-R1-Efficient, a two-stage training paradigm that reduces reasoning length while boosting overall performance. In addition, we construct a compact yet high-quality dataset containing 2K challenging tasks across five prevalent mobile device action types. Experiments show that our proposed models (e.g., UI-R1-3B) achieve substantial improvements over the base model (Qwen2.5-VL-3B) on both in-domain (ID) and out-of-domain (OOD) tasks, with average accuracy gains of 18.3% on ScreenSpot, 6.0% on ScreenSpot-Pro, and 10.9% on ANDROIDCONTROL. Moreover, our efficient versions deliver competitive performance compared to considerably larger state-of-the-art models, underscoring the potential of reinforcement learning to advance GUI control and paving the way for future research in Human-Computer Interaction (HCI).

UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information