DUET: Cross-Modal Semantic Grounding for Contrastive Zero-Shot Learning


  • Zhuo Chen College of Computer Science and Technology, Zhejiang University Donghai Laboratory, Zhoushan 316021, China Alibaba-Zhejiang University Joint Institute of Frontier Technologies
  • Yufeng Huang School of Software Technology, Zhejiang University Alibaba-Zhejiang University Joint Institute of Frontier Technologies
  • Jiaoyan Chen Department of Computer Science, The University of Manchester
  • Yuxia Geng College of Computer Science and Technology, Zhejiang University Alibaba-Zhejiang University Joint Institute of Frontier Technologies
  • Wen Zhang School of Software Technology, Zhejiang University Alibaba-Zhejiang University Joint Institute of Frontier Technologies
  • Yin Fang College of Computer Science and Technology, Zhejiang University Alibaba-Zhejiang University Joint Institute of Frontier Technologies
  • Jeff Z. Pan School of Informatics, The University of Edinburgh
  • Huajun Chen College of Computer Science and Technology, Zhejiang University Donghai Laboratory, Zhoushan 316021, China Alibaba-Zhejiang University Joint Institute of Frontier Technologies




CV: Multi-modal Vision, CV: Language and Vision, CV: Representation Learning for Vision, DMKM: Mining of Visual, Multimedia & Multimodal Data


Zero-shot learning (ZSL) aims to predict unseen classes whose samples have never appeared during training. One of the most effective and widely used semantic information for zero-shot image classification are attributes which are annotations for class-level visual characteristics. However, the current methods often fail to discriminate those subtle visual distinctions between images due to not only the shortage of fine-grained annotations, but also the attribute imbalance and co-occurrence. In this paper, we present a transformer-based end-to-end ZSL method named DUET, which integrates latent semantic knowledge from the pre-trained language models (PLMs) via a self-supervised multi-modal learning paradigm. Specifically, we (1) developed a cross-modal semantic grounding network to investigate the model's capability of disentangling semantic attributes from the images; (2) applied an attribute-level contrastive learning strategy to further enhance the model's discrimination on fine-grained visual characteristics against the attribute co-occurrence and imbalance; (3) proposed a multi-task learning policy for considering multi-model objectives. We find that our DUET can achieve state-of-the-art performance on three standard ZSL benchmarks and a knowledge graph equipped ZSL benchmark. Its components are effective and its predictions are interpretable.




How to Cite

Chen, Z., Huang, Y., Chen, J., Geng, Y., Zhang, W., Fang, Y., Z. Pan, J., & Chen, H. (2023). DUET: Cross-Modal Semantic Grounding for Contrastive Zero-Shot Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 37(1), 405-413. https://doi.org/10.1609/aaai.v37i1.25114



AAAI Technical Track on Computer Vision I