Vision-G1: Towards General Reasoning Vision-Language Models via Reinforcement Learning
DOI:
https://doi.org/10.1609/aaai.v40i33.40039Abstract
Recent vision-language models (VLMs) show strong reasoning capabilities through training with reinforcement learning from verifiable rewards (RLVR). Despite their impressive capabilities, current VLMs focus on a limited range of reasoning tasks, such as mathematical and logical reasoning, due to the lack of readily available verifiable reward data in broader domains. As a result, these models struggle to generalize their reasoning abilities to the wide variety of challenges encountered in real-world environments. To address this limitation, we collect and assemble a comprehensive RL-ready visual reasoning training dataset encompassing 46 datasets across 13 dimensions of 5 domains, covering a wide range of realistic scenarios such as infographic reasoning, mathematical reasoning, spatial reasoning, and general science reasoning. Based on this dataset, we propose an influence function-based data filtering strategy and a multi-round data curriculum method to iteratively strengthen general visual reasoning abilities. Using this approach, we train a general reasoning VLM, namely Vision-G1. Our 7B model achieves state-of-the-art performance across nine visual reasoning benchmarks, surpassing previous similar-sized VLMs and even GPT-4o and Gemini-1.5 Flash.Downloads
Published
2026-03-14
How to Cite
Zha, Y., Zhou, K., Wu, Y., Wang, Y., Feng, J., Xu, Z., … Hu, Z. (2026). Vision-G1: Towards General Reasoning Vision-Language Models via Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 40(33), 28131–28139. https://doi.org/10.1609/aaai.v40i33.40039
Issue
Section
AAAI Technical Track on Machine Learning X