DUET: Cross-Modal Semantic Grounding for Contrastive Zero-Shot Learning

Authors

  • Zhuo Chen College of Computer Science and Technology, Zhejiang University Donghai Laboratory, Zhoushan 316021, China Alibaba-Zhejiang University Joint Institute of Frontier Technologies
  • Yufeng Huang School of Software Technology, Zhejiang University Alibaba-Zhejiang University Joint Institute of Frontier Technologies
  • Jiaoyan Chen Department of Computer Science, The University of Manchester
  • Yuxia Geng College of Computer Science and Technology, Zhejiang University Alibaba-Zhejiang University Joint Institute of Frontier Technologies
  • Wen Zhang School of Software Technology, Zhejiang University Alibaba-Zhejiang University Joint Institute of Frontier Technologies
  • Yin Fang College of Computer Science and Technology, Zhejiang University Alibaba-Zhejiang University Joint Institute of Frontier Technologies
  • Jeff Z. Pan School of Informatics, The University of Edinburgh
  • Huajun Chen College of Computer Science and Technology, Zhejiang University Donghai Laboratory, Zhoushan 316021, China Alibaba-Zhejiang University Joint Institute of Frontier Technologies

DOI:

https://doi.org/10.1609/aaai.v37i1.25114

Keywords:

CV: Multi-modal Vision, CV: Language and Vision, CV: Representation Learning for Vision, DMKM: Mining of Visual, Multimedia & Multimodal Data

Abstract

Zero-shot learning (ZSL) aims to predict unseen classes whose samples have never appeared during training. One of the most effective and widely used semantic information for zero-shot image classification are attributes which are annotations for class-level visual characteristics. However, the current methods often fail to discriminate those subtle visual distinctions between images due to not only the shortage of fine-grained annotations, but also the attribute imbalance and co-occurrence. In this paper, we present a transformer-based end-to-end ZSL method named DUET, which integrates latent semantic knowledge from the pre-trained language models (PLMs) via a self-supervised multi-modal learning paradigm. Specifically, we (1) developed a cross-modal semantic grounding network to investigate the model's capability of disentangling semantic attributes from the images; (2) applied an attribute-level contrastive learning strategy to further enhance the model's discrimination on fine-grained visual characteristics against the attribute co-occurrence and imbalance; (3) proposed a multi-task learning policy for considering multi-model objectives. We find that our DUET can achieve state-of-the-art performance on three standard ZSL benchmarks and a knowledge graph equipped ZSL benchmark. Its components are effective and its predictions are interpretable.

Downloads

Published

2023-06-26

How to Cite

Chen, Z., Huang, Y., Chen, J., Geng, Y., Zhang, W., Fang, Y., Z. Pan, J., & Chen, H. (2023). DUET: Cross-Modal Semantic Grounding for Contrastive Zero-Shot Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 37(1), 405-413. https://doi.org/10.1609/aaai.v37i1.25114

Issue

Section

AAAI Technical Track on Computer Vision I