GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and a Comprehensive Multimodal Dataset Towards General Medical AI

Authors

  • Tianbin Li Shanghai Artificial Intelligence Laboratory
  • Yanzhou Su Shanghai Artificial Intelligence Laboratory
  • Wei Li Shanghai Artificial Intelligence Laboratory Shanghai Jiao Tong University
  • Bin Fu Shanghai Artificial Intelligence Laboratory
  • Zhe Chen Shanghai Artificial Intelligence Laboratory
  • Ziyan Huang Shanghai Artificial Intelligence Laboratory Shanghai Jiao Tong University
  • Guoan Wang Shanghai Artificial Intelligence Laboratory
  • Chenglong Ma Shanghai Artificial Intelligence Laboratory
  • Ying Chen Shanghai Artificial Intelligence Laboratory
  • Ming Hu Shanghai Artificial Intelligence Laboratory
  • Yanjun Li Shanghai Artificial Intelligence Laboratory
  • Pengcheng Chen Shanghai Artificial Intelligence Laboratory
  • Shixiang Tang Shanghai Artificial Intelligence Laboratory
  • Xiaowei Hu Shanghai Artificial Intelligence Laboratory
  • Zhongying Deng Shanghai Artificial Intelligence Laboratory
  • Yuanfeng Ji Stanford University
  • Jin Ye Shanghai Artificial Intelligence Laboratory
  • Yu Qiao Shanghai Artificial Intelligence Laboratory
  • Junjun He Shanghai Artificial Intelligence Laboratory

DOI:

https://doi.org/10.1609/aaai.v40i28.39485

Abstract

Despite significant advancements in general AI, its effectiveness in the medical domain is limited by the lack of specialized medical knowledge. To address this, we formulate GMAI-VL-5.5M, a multimodal medical dataset created by converting hundreds of specialized medical datasets with various annotations into high-quality image-text pairs. This dataset offers comprehensive task coverage, diverse modalities, and rich image-text data. Building upon this dataset, we develop GMAI-VL, a 7B-parameter general medical vision-language model, with a three-stage training strategy that enhances the integration of visual and textual information. This approach significantly improves the model's ability to process multimodal data, supporting accurate diagnoses and clinical decision-making. Experiments show that GMAI-VL achieves state-of-the-art performance across various multimodal medical tasks, including visual question answering and medical image diagnosis.

Downloads

Published

2026-03-14

How to Cite

Li, T., Su, Y., Li, W., Fu, B., Chen, Z., Huang, Z., … He, J. (2026). GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and a Comprehensive Multimodal Dataset Towards General Medical AI. Proceedings of the AAAI Conference on Artificial Intelligence, 40(28), 23177–23185. https://doi.org/10.1609/aaai.v40i28.39485

Issue

Section

AAAI Technical Track on Machine Learning V