HVTSurv: Hierarchical Vision Transformer for Patient-Level Survival Prediction from Whole Slide Image

Authors

  • Zhuchen Shao Tsinghua Shenzhen International Graduate School, Tsinghua University
  • Yang Chen Tsinghua Shenzhen International Graduate School, Tsinghua University
  • Hao Bian Tsinghua Shenzhen International Graduate School, Tsinghua University
  • Jian Zhang Peking University Shenzhen Graduate School
  • Guojun Liu Harbin Institute of Technology, China
  • Yongbing Zhang Harbin Institute of Technology (Shenzhen)

DOI:

https://doi.org/10.1609/aaai.v37i2.25315

Keywords:

CV: Medical and Biological Imaging, ML: Multi-Instance/Multi-View Learning

Abstract

Survival prediction based on whole slide images (WSIs) is a challenging task for patient-level multiple instance learning (MIL). Due to the vast amount of data for a patient (one or multiple gigapixels WSIs) and the irregularly shaped property of WSI, it is difficult to fully explore spatial, contextual, and hierarchical interaction in the patient-level bag. Many studies adopt random sampling pre-processing strategy and WSI-level aggregation models, which inevitably lose critical prognostic information in the patient-level bag. In this work, we propose a hierarchical vision Transformer framework named HVTSurv, which can encode the local-level relative spatial information, strengthen WSI-level context-aware communication, and establish patient-level hierarchical interaction. Firstly, we design a feature pre-processing strategy, including feature rearrangement and random window masking. Then, we devise three layers to progressively obtain patient-level representation, including a local-level interaction layer adopting Manhattan distance, a WSI-level interaction layer employing spatial shuffle, and a patient-level interaction layer using attention pooling. Moreover, the design of hierarchical network helps the model become more computationally efficient. Finally, we validate HVTSurv with 3,104 patients and 3,752 WSIs across 6 cancer types from The Cancer Genome Atlas (TCGA). The average C-Index is 2.50-11.30% higher than all the prior weakly supervised methods over 6 TCGA datasets. Ablation study and attention visualization further verify the superiority of the proposed HVTSurv. Implementation is available at: https://github.com/szc19990412/HVTSurv.

Downloads

Published

2023-06-26

How to Cite

Shao, Z., Chen, Y., Bian, H., Zhang, J., Liu, G., & Zhang, Y. (2023). HVTSurv: Hierarchical Vision Transformer for Patient-Level Survival Prediction from Whole Slide Image. Proceedings of the AAAI Conference on Artificial Intelligence, 37(2), 2209-2217. https://doi.org/10.1609/aaai.v37i2.25315

Issue

Section

AAAI Technical Track on Computer Vision II