MedAtlas: Evaluating LLMs for Multi-Round, Multi-Task Medical Reasoning Across Diverse Imaging Modalities and Clinical Text

Authors

  • Ronghao Xu School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui 230026, P.R. China Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou, Jiangsu 215123, P.R. China
  • Zhen Huang School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, P.R. China School of Information Science and Technology, Eastern Institute of Technology, Ningbo 315200, P.R. China
  • Yangbo Wei School of Information Science and Technology, Eastern Institute of Technology, Ningbo 315200, P.R. China Shanghai Jiao Tong University, Shanghai 200030, P.R. China
  • Xiaoqian Zhou School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui 230026, P.R. China Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou, Jiangsu 215123, P.R. China
  • Zikang Xu Anhui Province Key Laboratory of Biomedical Imaging and Intelligent Processing,Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei 230026, P.R. China
  • Ting Liu School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui 230026, P.R. China Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou, Jiangsu 215123, P.R. China
  • Zihang Jiang School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui 230026, P.R. China Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou, Jiangsu 215123, P.R. China
  • S. Kevin Zhou School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui 230026, P.R. China Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou, Jiangsu 215123, P.R. China

DOI:

https://doi.org/10.1609/aaai.v40i44.41142

Abstract

Artificial intelligence has demonstrated significant potential in clinical decision-making; however, developing models capable of adapting to diverse real-world scenarios and performing complex diagnostic reasoning remains a major challenge. Existing medical multi-modal benchmarks are typically limited to single-image, single-turn tasks, lacking multi-modal medical image integration and failing to capture the longitudinal and multi-modal interactive nature inherent to clinical practice. To address this gap, we introduce MedAtlas, a novel benchmark framework designed to evaluate large language models on realistic medical reasoning tasks. MedAtlas is characterized by four key features: multi-round visual question answering (VQA), Joint reasoning of multiple modalities of medical images, multi-task integration, and high clinical fidelity. It supports four core tasks: open-ended multi-round VQA, closed-ended multi-round VQA, multi-image joint reasoning, and comprehensive disease diagnosis. Each case is derived from real diagnostic workflows and incorporates temporal interactions between textual medical histories and multiple imaging modalities, including CT, MRI, PET, ultrasound, X-ray, etc., requiring models to perform deep integrative reasoning across images and clinical texts. MedAtlas provides expert-annotated gold standards for all tasks. Furthermore, we propose two novel evaluation metrics: Stage Chain Accuracy (SCA) and Error Propagation Suppression Coefficient (EPSC). Benchmark results with existing multi-modal models reveal substantial performance gaps in multi-stage clinical reasoning. MedAtlas establishes a challenging evaluation platform to advance the development of robust and trustworthy medical AI.

Published

2026-03-14

How to Cite

Xu, R., Huang, Z., Wei, Y., Zhou, X., Xu, Z., Liu, T., … Zhou, S. K. (2026). MedAtlas: Evaluating LLMs for Multi-Round, Multi-Task Medical Reasoning Across Diverse Imaging Modalities and Clinical Text. Proceedings of the AAAI Conference on Artificial Intelligence, 40(44), 38048–38056. https://doi.org/10.1609/aaai.v40i44.41142

Issue

Section

AAAI Special Track on AI Alignment