Unleashing the Power of Visual Foundation Models for Generalizable Semantic Segmentation

PeiYuan Tang; Xiaodong Zhang; Chunze Yang; Haoran Yuan; Jun Sun; Danfeng Shan; Zijiang James Yang

doi:10.1609/aaai.v39i19.34295

Authors

PeiYuan Tang School of Computer Science and Technology, Xi’an Jiaotong University
Xiaodong Zhang School of Computer Science and Technology, Xidian University Shaanxi Key Laboratory of Network and System Security, Xidian University
Chunze Yang School of Computer Science and Technology, Xi’an Jiaotong University
Haoran Yuan Synkrotron, Inc.
Jun Sun Singapore Management University
Danfeng Shan School of Computer Science and Technology, Xi’an Jiaotong University
Zijiang James Yang Synkrotron, Inc. University of Science and Technology of China

DOI:

https://doi.org/10.1609/aaai.v39i19.34295

Abstract

Deep learning models often suffer from performance degradation in unseen domains, posing a risk for safety-critical applications such as autonomous driving. To tackle this problem, recent studies have leveraged pre-trained Visual Foundation Models (VFMs) to enhance generalization. However, exsiting works mainly focus on designing intricate networks for VFMs, neglecting their inherent strong generalization potential. Moreover, these methods typically perform inference on low-resolution images. The loss of detail hinders accurate predictions in unseen domains, especially for small objects. In this paper, we argue that simply fine-tuning VFMs and leveraging high-resolution images unleash the power of VFMs for generalizable semantic segmentation. Therefore, we design a VFM-based segmentation network (VFMNet) that adapts VFMs to this task with minimal fine-tuning, preserving their generalizable knowledge. Then, to fully utilize high-resolution images, we train a Mask-guided Refinement Network (MGRNet) to refine VFMNet's predictions combining detailed image features. Furthermore, we adopt a two-stage coarse-to-fine inference approach. MGRNet is used to refine the low-confidence regions predicted by VFMNet to obtain fine-grained results. Extensive experiments demonstrate the effectiveness of our method, outperforming state-of-the-art methods by 3.3% on the average mIoU in synthetic-to-real domain generalization.

Unleashing the Power of Visual Foundation Models for Generalizable Semantic Segmentation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information