Vision-G1: Towards General Reasoning Vision-Language Models via Reinforcement Learning

Yuheng Zha; Kun Zhou; Yujia Wu; Yushu Wang; Jie Feng; Zhi Xu; Shibo Hao; Zhengzhong Liu; Eric P. Xing; Zhiting Hu

doi:10.1609/aaai.v40i33.40039

Authors

Yuheng Zha University of California, San Diego
Kun Zhou University of California, San Diego
Yujia Wu University of California, San Diego
Yushu Wang University of California, San Diego
Jie Feng University of California, San Diego
Zhi Xu University of California, San Diego
Shibo Hao University of California, San Diego
Zhengzhong Liu Mohamed bin Zayed University of Artificial Intelligence
Eric P. Xing Carnegie Mellon University Mohamed bin Zayed Univeristy of AI
Zhiting Hu University of California, San Diego

DOI:

https://doi.org/10.1609/aaai.v40i33.40039

Abstract

Recent vision-language models (VLMs) show strong reasoning capabilities through training with reinforcement learning from verifiable rewards (RLVR). Despite their impressive capabilities, current VLMs focus on a limited range of reasoning tasks, such as mathematical and logical reasoning, due to the lack of readily available verifiable reward data in broader domains. As a result, these models struggle to generalize their reasoning abilities to the wide variety of challenges encountered in real-world environments. To address this limitation, we collect and assemble a comprehensive RL-ready visual reasoning training dataset encompassing 46 datasets across 13 dimensions of 5 domains, covering a wide range of realistic scenarios such as infographic reasoning, mathematical reasoning, spatial reasoning, and general science reasoning. Based on this dataset, we propose an influence function-based data filtering strategy and a multi-round data curriculum method to iteratively strengthen general visual reasoning abilities. Using this approach, we train a general reasoning VLM, namely Vision-G1. Our 7B model achieves state-of-the-art performance across nine visual reasoning benchmarks, surpassing previous similar-sized VLMs and even GPT-4o and Gemini-1.5 Flash.

Vision-G1: Towards General Reasoning Vision-Language Models via Reinforcement Learning

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information