Vision-G1: Towards General Reasoning Vision-Language Models via Reinforcement Learning

Authors

  • Yuheng Zha University of California, San Diego
  • Kun Zhou University of California, San Diego
  • Yujia Wu University of California, San Diego
  • Yushu Wang University of California, San Diego
  • Jie Feng University of California, San Diego
  • Zhi Xu University of California, San Diego
  • Shibo Hao University of California, San Diego
  • Zhengzhong Liu Mohamed bin Zayed University of Artificial Intelligence
  • Eric P. Xing Carnegie Mellon University Mohamed bin Zayed Univeristy of AI
  • Zhiting Hu University of California, San Diego

DOI:

https://doi.org/10.1609/aaai.v40i33.40039

Abstract

Recent vision-language models (VLMs) show strong reasoning capabilities through training with reinforcement learning from verifiable rewards (RLVR). Despite their impressive capabilities, current VLMs focus on a limited range of reasoning tasks, such as mathematical and logical reasoning, due to the lack of readily available verifiable reward data in broader domains. As a result, these models struggle to generalize their reasoning abilities to the wide variety of challenges encountered in real-world environments. To address this limitation, we collect and assemble a comprehensive RL-ready visual reasoning training dataset encompassing 46 datasets across 13 dimensions of 5 domains, covering a wide range of realistic scenarios such as infographic reasoning, mathematical reasoning, spatial reasoning, and general science reasoning. Based on this dataset, we propose an influence function-based data filtering strategy and a multi-round data curriculum method to iteratively strengthen general visual reasoning abilities. Using this approach, we train a general reasoning VLM, namely Vision-G1. Our 7B model achieves state-of-the-art performance across nine visual reasoning benchmarks, surpassing previous similar-sized VLMs and even GPT-4o and Gemini-1.5 Flash.

Downloads

Published

2026-03-14

How to Cite

Zha, Y., Zhou, K., Wu, Y., Wang, Y., Feng, J., Xu, Z., … Hu, Z. (2026). Vision-G1: Towards General Reasoning Vision-Language Models via Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 40(33), 28131–28139. https://doi.org/10.1609/aaai.v40i33.40039

Issue

Section

AAAI Technical Track on Machine Learning X