Learning Beyond Vision: Vision-Language Distillation and Edge-Aware Mix Diffusion in Semi-Supervised Semantic Segmentation

Rui Yang; Yunfei Bai; Yuehua Liu; Xiaomao Li; Shaorong Xie

doi:10.1609/aaai.v40i14.38152

Authors

Rui Yang School of Computer Engineering and Science, Shanghai University
Yunfei Bai School of Mechatronic Engineering and Automation, Shanghai University
Yuehua Liu School of Computer Engineering and Science, Shanghai University
Xiaomao Li School of Mechatronic Engineering and Automation, Shanghai University
Shaorong Xie School of Computer Engineering and Science, Shanghai University

DOI:

https://doi.org/10.1609/aaai.v40i14.38152

Abstract

In semi-supervised semantic segmentation (SSSS), segmentation performance is heavily constrained by the quality of pseudo labels. However, prevalent pseudo-label optimization approaches rely on the model’s internal self-correction. When the model fails to recognize or adequately represent certain classes, this self-enhancement mechanism amplifies initial mistakes, ultimately leading to poor semantic or spatial consistency. To address this limitation, we propose ViLaDiff to enhance pseudo-label quality. Specifically, ViLaDiff first employs a prompt-guided image captioning task to generate descriptive text for each input image, providing high-level semantic context. To our knowledge, this is the first attempt to introduce vision-language modeling into SSSS. We design a vision-language fusion module to enhance feature semantics and discriminative capability. It integrates cross-modal interactions with dual-path knowledge to ensure semantic consistency. Additionally, while language provides high-level semantic guidance, it is inherently limited in expressing fine-grained spatial structures. Therefore, we propose an edge-aware mixed-noise diffusion process. It simulates feature-level uncertainty through Gaussian perturbations and introduces class-flipping noise into the masks to model misclassification errors. To enhance boundary refinement, we apply a higher flipping probability along mask edges, enabling edge-aware modeling during denoising. Extensive experiments on public benchmarks validate that our method significantly improves pseudo-label quality and segmentation performance.

Learning Beyond Vision: Vision-Language Distillation and Edge-Aware Mix Diffusion in Semi-Supervised Semantic Segmentation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information