SeViL: Semi-supervised Vision-Language Learning with Text Prompt Guiding for Moving Infrared Small Target Detection

Weiwei Duan; Luping Ji; Jianghong Huang; Sicheng Zhu

doi:10.1609/aaai.v40i5.37372

Authors

Weiwei Duan University of Electronic Science and Technology of China
Luping Ji University of Electronic Science and Technology of China
Jianghong Huang University of Electronic Science and Technology of China
Sicheng Zhu University of Electronic Science and Technology of China

DOI:

https://doi.org/10.1609/aaai.v40i5.37372

Abstract

Unlike traditional object detection, moving infrared small target detection is highly challenging due to tiny target size and limited labeled samples. Currently, most existing methods mainly focus on the pure-vision features usually by fully-supervised learning, heavily relying on extensive high-cost manual annotations. Moreover, they almost have not concerned the potentials of multi-modal (e.g., vision and text) learning yet. To address these issues, inspired by prevalent vision-language models, we propose the first semi-supervised vision-language (SeViL) framework with adaptive text prompt guiding. Breaking through traditional pure-vision modality, it takes text prompts as prior knowledge to adaptively enhance target regions and then filter the low-quality pseudo-labels generated on unlabeled data. In the meanwhile, we employ an adaptive cross-modal masking strategy to align text and vision features, promoting cross-modal deep interactions. Remarkably, our extensive experiments on three public datasets (DAUB, ITSDT-15K and IRDST) verify that our new scheme could outperform other semi-supervised ones, and even achieve comparable performance to fully-supervised state-of-the-art (SOTA) methods, with only 10% labeled training samples.

SeViL: Semi-supervised Vision-Language Learning with Text Prompt Guiding for Moving Infrared Small Target Detection

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information