Learning Beyond Vision: Vision-Language Distillation and Edge-Aware Mix Diffusion in Semi-Supervised Semantic Segmentation

Authors

  • Rui Yang School of Computer Engineering and Science, Shanghai University
  • Yunfei Bai School of Mechatronic Engineering and Automation, Shanghai University
  • Yuehua Liu School of Computer Engineering and Science, Shanghai University
  • Xiaomao Li School of Mechatronic Engineering and Automation, Shanghai University
  • Shaorong Xie School of Computer Engineering and Science, Shanghai University

DOI:

https://doi.org/10.1609/aaai.v40i14.38152

Abstract

In semi-supervised semantic segmentation (SSSS), segmentation performance is heavily constrained by the quality of pseudo labels. However, prevalent pseudo-label optimization approaches rely on the model’s internal self-correction. When the model fails to recognize or adequately represent certain classes, this self-enhancement mechanism amplifies initial mistakes, ultimately leading to poor semantic or spatial consistency. To address this limitation, we propose ViLaDiff to enhance pseudo-label quality. Specifically, ViLaDiff first employs a prompt-guided image captioning task to generate descriptive text for each input image, providing high-level semantic context. To our knowledge, this is the first attempt to introduce vision-language modeling into SSSS. We design a vision-language fusion module to enhance feature semantics and discriminative capability. It integrates cross-modal interactions with dual-path knowledge to ensure semantic consistency. Additionally, while language provides high-level semantic guidance, it is inherently limited in expressing fine-grained spatial structures. Therefore, we propose an edge-aware mixed-noise diffusion process. It simulates feature-level uncertainty through Gaussian perturbations and introduces class-flipping noise into the masks to model misclassification errors. To enhance boundary refinement, we apply a higher flipping probability along mask edges, enabling edge-aware modeling during denoising. Extensive experiments on public benchmarks validate that our method significantly improves pseudo-label quality and segmentation performance.

Downloads

Published

2026-03-14

How to Cite

Yang, R., Bai, Y., Liu, Y., Li, X., & Xie, S. (2026). Learning Beyond Vision: Vision-Language Distillation and Edge-Aware Mix Diffusion in Semi-Supervised Semantic Segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(14), 11676–11684. https://doi.org/10.1609/aaai.v40i14.38152

Issue

Section

AAAI Technical Track on Computer Vision XI