Unleashing the Power of Visual Foundation Models for Generalizable Semantic Segmentation

Authors

  • PeiYuan Tang School of Computer Science and Technology, Xi’an Jiaotong University
  • Xiaodong Zhang School of Computer Science and Technology, Xidian University Shaanxi Key Laboratory of Network and System Security, Xidian University
  • Chunze Yang School of Computer Science and Technology, Xi’an Jiaotong University
  • Haoran Yuan Synkrotron, Inc.
  • Jun Sun Singapore Management University
  • Danfeng Shan School of Computer Science and Technology, Xi’an Jiaotong University
  • Zijiang James Yang Synkrotron, Inc. University of Science and Technology of China

DOI:

https://doi.org/10.1609/aaai.v39i19.34295

Abstract

Deep learning models often suffer from performance degradation in unseen domains, posing a risk for safety-critical applications such as autonomous driving. To tackle this problem, recent studies have leveraged pre-trained Visual Foundation Models (VFMs) to enhance generalization. However, exsiting works mainly focus on designing intricate networks for VFMs, neglecting their inherent strong generalization potential. Moreover, these methods typically perform inference on low-resolution images. The loss of detail hinders accurate predictions in unseen domains, especially for small objects. In this paper, we argue that simply fine-tuning VFMs and leveraging high-resolution images unleash the power of VFMs for generalizable semantic segmentation. Therefore, we design a VFM-based segmentation network (VFMNet) that adapts VFMs to this task with minimal fine-tuning, preserving their generalizable knowledge. Then, to fully utilize high-resolution images, we train a Mask-guided Refinement Network (MGRNet) to refine VFMNet's predictions combining detailed image features. Furthermore, we adopt a two-stage coarse-to-fine inference approach. MGRNet is used to refine the low-confidence regions predicted by VFMNet to obtain fine-grained results. Extensive experiments demonstrate the effectiveness of our method, outperforming state-of-the-art methods by 3.3% on the average mIoU in synthetic-to-real domain generalization.

Published

2025-04-11

How to Cite

Tang, P., Zhang, X., Yang, C., Yuan, H., Sun, J., Shan, D., & Yang, Z. J. (2025). Unleashing the Power of Visual Foundation Models for Generalizable Semantic Segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 39(19), 20823–20831. https://doi.org/10.1609/aaai.v39i19.34295

Issue

Section

AAAI Technical Track on Machine Learning V