Sim4Seg: Boosting Multimodal Multi-disease Medical Diagnosis Segmentation with Region-Aware Vision-Language Similarity Masks

Authors

  • Lingran Song University of Macau
  • Yucheng Zhou University of Macau
  • Jianbing Shen University of Macau

DOI:

https://doi.org/10.1609/aaai.v40i11.37864

Abstract

Despite significant progress in pixel-level medical image analysis, existing medical image segmentation models rarely explore medical segmentation and diagnosis tasks jointly. However, it is crucial for patients that models can provide explainable diagnoses along with medical segmentation results. In this paper, we introduce a medical vision-language task named Medical Diagnosis Segmentation (MDS), which aims to understand clinical queries for medical images and generate the corresponding segmentation masks as well as diagnostic results. To facilitate this task, we first present the Multimodal Multi-disease Medical Diagnosis Segmentation (M3DS) dataset, containing diverse multimodal multi-disease medical images paired with their corresponding segmentation masks and diagnosis chain-of-thought, created via an automated diagnosis chain-of-thought generation pipeline. Moreover, we propose Sim4Seg, a novel framework that improves the performance of diagnosis segmentation by taking advantage of the Region-Aware Vision-Language Similarity to Mask (RVLS2M) module. To improve overall performance, we investigate a test-time scaling strategy for MDS tasks. Experimental results demonstrate that our method outperforms baselines in both segmentation and diagnosis.

Downloads

Published

2026-03-14

How to Cite

Song, L., Zhou, Y., & Shen, J. (2026). Sim4Seg: Boosting Multimodal Multi-disease Medical Diagnosis Segmentation with Region-Aware Vision-Language Similarity Masks. Proceedings of the AAAI Conference on Artificial Intelligence, 40(11), 9079–9087. https://doi.org/10.1609/aaai.v40i11.37864

Issue

Section

AAAI Technical Track on Computer Vision VIII