GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and a Comprehensive Multimodal Dataset Towards General Medical AI

Tianbin Li; Yanzhou Su; Wei Li; Bin Fu; Zhe Chen; Ziyan Huang; Guoan Wang; Chenglong Ma; Ying Chen; Ming Hu; Yanjun Li; Pengcheng Chen; Shixiang Tang; Xiaowei Hu; Zhongying Deng; Yuanfeng Ji; Jin Ye; Yu Qiao; Junjun He

doi:10.1609/aaai.v40i28.39485

Authors

Tianbin Li Shanghai Artificial Intelligence Laboratory
Yanzhou Su Shanghai Artificial Intelligence Laboratory
Wei Li Shanghai Artificial Intelligence Laboratory Shanghai Jiao Tong University
Bin Fu Shanghai Artificial Intelligence Laboratory
Zhe Chen Shanghai Artificial Intelligence Laboratory
Ziyan Huang Shanghai Artificial Intelligence Laboratory Shanghai Jiao Tong University
Guoan Wang Shanghai Artificial Intelligence Laboratory
Chenglong Ma Shanghai Artificial Intelligence Laboratory
Ying Chen Shanghai Artificial Intelligence Laboratory
Ming Hu Shanghai Artificial Intelligence Laboratory
Yanjun Li Shanghai Artificial Intelligence Laboratory
Pengcheng Chen Shanghai Artificial Intelligence Laboratory
Shixiang Tang Shanghai Artificial Intelligence Laboratory
Xiaowei Hu Shanghai Artificial Intelligence Laboratory
Zhongying Deng Shanghai Artificial Intelligence Laboratory
Yuanfeng Ji Stanford University
Jin Ye Shanghai Artificial Intelligence Laboratory
Yu Qiao Shanghai Artificial Intelligence Laboratory
Junjun He Shanghai Artificial Intelligence Laboratory

DOI:

https://doi.org/10.1609/aaai.v40i28.39485

Abstract

Despite significant advancements in general AI, its effectiveness in the medical domain is limited by the lack of specialized medical knowledge. To address this, we formulate GMAI-VL-5.5M, a multimodal medical dataset created by converting hundreds of specialized medical datasets with various annotations into high-quality image-text pairs. This dataset offers comprehensive task coverage, diverse modalities, and rich image-text data. Building upon this dataset, we develop GMAI-VL, a 7B-parameter general medical vision-language model, with a three-stage training strategy that enhances the integration of visual and textual information. This approach significantly improves the model's ability to process multimodal data, supporting accurate diagnoses and clinical decision-making. Experiments show that GMAI-VL achieves state-of-the-art performance across various multimodal medical tasks, including visual question answering and medical image diagnosis.

GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and a Comprehensive Multimodal Dataset Towards General Medical AI

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information