Dual-Phase Visual-Language Pretraining and Adaptation for Long-Tailed Multi-Label Recognition

Authors

  • Yongcheng Li Tongji University
  • Xuekuan Wang Tongji University
  • Zhifei Zhang Tongji University
  • Cairong Zhao Tongji University

DOI:

https://doi.org/10.1609/aaai.v40i8.37600

Abstract

Long-Tailed Multi-Label Recognition (LTML) is a critical yet challenging task due to two core issues: the severe scarcity of training samples for rare "tail" classes, and the complex co-occurrence patterns among labels that often lead to biased models. To address this, we propose DP-VLPA, a novel Dual-Phase Visual-Language Pretraining and Adaptation framework. In the first phase, our Structured Tail-Aware Generation (STAG) module employs a Large Language Model (LLM) to create detailed descriptions that explicitly emphasize tail classes and their contextual relationships, providing a strong and less-biased feature foundation. In the second adaptation phase, we ensure this knowledge is applied effectively. A Dynamic Query Reweighting (DQR) mechanism forces the model to attend to crucial tail-class evidence. Simultaneously, a Co-occurrence-Aware (COA) loss explicitly teaches the model the statistical dependencies between labels, correcting for co-occurrence biases. Extensive experiments on VOC-LT and COCO-LT datasets demonstrate state-of-the-art performance, achieving mAP scores of 90.72% and 74.42% respectively - surpassing previous best methods by 2.84% and 8.23%.

Downloads

Published

2026-03-14

How to Cite

Li, Y., Wang, X., Zhang, Z., & Zhao, C. (2026). Dual-Phase Visual-Language Pretraining and Adaptation for Long-Tailed Multi-Label Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 40(8), 6690–6698. https://doi.org/10.1609/aaai.v40i8.37600

Issue

Section

AAAI Technical Track on Computer Vision V