PathFLIP: Fine-grained Language-Image Pretraining for Versatile Computational Pathology

Authors

  • Fengchun Liu Harbin Institute of Technology, Shenzhen, School of Computer Science and Technology
  • Songhan Jiang Harbin Institute of Technology, Shenzhen, School of Computer Science and Technology
  • Linghan Cai Harbin Institute of Technology, Shenzhen, School of Computer Science and Technology
  • Ziyue Wang National University of Singapore, Department of Electronic and Computer Engineering
  • Yongbing Zhang Harbin Institute of Technology, Shenzhen, School of Computer Science and Technology

DOI:

https://doi.org/10.1609/aaai.v40i9.37649

Abstract

While Vision-Language Models (VLMs) have achieved notable progress in computational pathology (CPath), the gigapixel scale and spatial heterogeneity of Whole Slide Images (WSIs) continue to pose challenges for multimodal understanding. Existing alignment methods struggle to capture fine-grained correspondences between textual descriptions and visual cues across thousands of patches from a slide, compromising their performance on downstream tasks. In this paper, we propose PathFLIP (Pathology Fine-grained Language-Image Pretraining), a novel framework for holistic WSI interpretation. PathFLIP decomposes slide-level captions into region-level sub-captions and generates text-conditioned region embeddings to facilitate precise visual-language grounding. By harnessing Large Language Models (LLMs), PathFLIP can seamlessly follow diverse clinical instructions and adapt to varied diagnostic contexts. Furthermore, it exhibits versatile capabilities across multiple paradigms, efficiently handling slide-level classification and retrieval, fine-grained lesion localization, and instruction following. Extensive experiments demonstrate that PathFLIP outperforms existing large-scale pathological VLMs on four representative benchmarks while requiring significantly less training data, paving the way for fine-grained, instruction-aware WSI interpretation in research and clinical practice.

Downloads

Published

2026-03-14

How to Cite

Liu, F., Jiang, S., Cai, L., Wang, Z., & Zhang, Y. (2026). PathFLIP: Fine-grained Language-Image Pretraining for Versatile Computational Pathology. Proceedings of the AAAI Conference on Artificial Intelligence, 40(9), 7132–7140. https://doi.org/10.1609/aaai.v40i9.37649

Issue

Section

AAAI Technical Track on Computer Vision VI