Dual-Phase Visual-Language Pretraining and Adaptation for Long-Tailed Multi-Label Recognition

Yongcheng Li; Xuekuan Wang; Zhifei Zhang; Cairong Zhao

doi:10.1609/aaai.v40i8.37600

Authors

Yongcheng Li Tongji University
Xuekuan Wang Tongji University
Zhifei Zhang Tongji University
Cairong Zhao Tongji University

DOI:

https://doi.org/10.1609/aaai.v40i8.37600

Abstract

Long-Tailed Multi-Label Recognition (LTML) is a critical yet challenging task due to two core issues: the severe scarcity of training samples for rare "tail" classes, and the complex co-occurrence patterns among labels that often lead to biased models. To address this, we propose DP-VLPA, a novel Dual-Phase Visual-Language Pretraining and Adaptation framework. In the first phase, our Structured Tail-Aware Generation (STAG) module employs a Large Language Model (LLM) to create detailed descriptions that explicitly emphasize tail classes and their contextual relationships, providing a strong and less-biased feature foundation. In the second adaptation phase, we ensure this knowledge is applied effectively. A Dynamic Query Reweighting (DQR) mechanism forces the model to attend to crucial tail-class evidence. Simultaneously, a Co-occurrence-Aware (COA) loss explicitly teaches the model the statistical dependencies between labels, correcting for co-occurrence biases. Extensive experiments on VOC-LT and COCO-LT datasets demonstrate state-of-the-art performance, achieving mAP scores of 90.72% and 74.42% respectively - surpassing previous best methods by 2.84% and 8.23%.

Dual-Phase Visual-Language Pretraining and Adaptation for Long-Tailed Multi-Label Recognition

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information