Transformer-Based Video-Structure Multi-Instance Learning for Whole Slide Image Classification

Authors

  • Yingfan Ma Digital Medical Research Center, School of Basic Medical Sciences, Fudan University, Shanghai 200032, China Shanghai Key Laboratory of Medical Imaging Computing and Computer Assisted Intervention, Shanghai 200032, China
  • Xiaoyuan Luo Digital Medical Research Center, School of Basic Medical Sciences, Fudan University, Shanghai 200032, China Shanghai Key Laboratory of Medical Imaging Computing and Computer Assisted Intervention, Shanghai 200032, China
  • Kexue Fu Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Jinan, China
  • Manning Wang Digital Medical Research Center, School of Basic Medical Sciences, Fudan University, Shanghai 200032, China Shanghai Key Laboratory of Medical Imaging Computing and Computer Assisted Intervention, Shanghai 200032, China

DOI:

https://doi.org/10.1609/aaai.v38i13.29338

Keywords:

ML: Multi-instance/Multi-view Learning, CV: Medical and Biological Imaging

Abstract

Pathological images play a vital role in clinical cancer diagnosis. Computer-aided diagnosis utilized on digital Whole Slide Images (WSIs) has been widely studied. The major challenge of using deep learning models for WSI analysis is the huge size of WSI images and existing methods struggle between end-to-end learning and proper modeling of contextual information. Most state-of-the-art methods utilize a two-stage strategy, in which they use a pre-trained model to extract features of small patches cut from a WSI and then input these features into a classification model. These methods can not perform end-to-end learning and consider contextual information at the same time. To solve this problem, we propose a framework that models a WSI as a pathologist's observing video and utilizes Transformer to process video clips with a divide-and-conquer strategy, which helps achieve both context-awareness and end-to-end learning. Extensive experiments on three public WSI datasets show that our proposed method outperforms existing SOTA methods in both WSI classification and positive region detection.

Published

2024-03-24

How to Cite

Ma, Y., Luo, X., Fu, K., & Wang, M. (2024). Transformer-Based Video-Structure Multi-Instance Learning for Whole Slide Image Classification. Proceedings of the AAAI Conference on Artificial Intelligence, 38(13), 14263-14271. https://doi.org/10.1609/aaai.v38i13.29338

Issue

Section

AAAI Technical Track on Machine Learning IV