Epsilon: Exploring Comprehensive Visual-Semantic Projection for Multi-Label Zero-Shot Learning

Authors

  • Ziming Liu Department of Computing, The Hong Kong Polytechnic University
  • Jingcai Guo Department of Computing, The Hong Kong Polytechnic University
  • Song Guo Department of Computer Science and Engineering, The Hong Kong University of Science and Technology
  • Xiaocheng Lu Department of Computer Science and Engineering, The Hong Kong University of Science and Technology

DOI:

https://doi.org/10.1609/aaai.v39i18.34098

Abstract

This paper investigates a challenging problem of zero-shot learning in the multi-label scenario (MLZSL), wherein the model is trained to recognize multiple unseen classes within a sample (e.g., an image) based on seen classes and auxiliary knowledge, e.g., semantic information. Existing methods usually resort to analyzing the relationship of various seen classes residing in a sample from the dimension of spatial or semantic characteristics and transferring the learned model to unseen ones. However, they neglect the integrity of local and global features. Although the use of the attention structure will accurately locate local features, especially objects, it will significantly lose its integrity, and the relationship between classes will also be affected. Rough processing of global features will also directly affect comprehensiveness. This neglect will make the model lose its grasp of the main components of the image. Relying only on the local existence of seen classes during the inference stage introduces unavoidable bias. In this paper, we propose a novel and comprehensive visual-semantic framework for MLZSL, dubbed Epsilon, to fully make use of such properties and enable a more accurate and robust visual-semantic projection. In terms of spatial information, we achieve effective refinement by group aggregating image features into several semantic prompts. It can aggregate semantic information rather than class information, preserving the correlation between semantics. In terms of global semantics, we use global forward propagation to collect as much information as possible to ensure that semantics are not omitted. Experiments on large-scale MLZSL benchmark datasets NUS-Wide and Open-Images-v4 demonstrate that the proposed Epsilon outperforms other state-of-the-art methods with large margins.

Published

2025-04-11

How to Cite

Liu, Z., Guo, J., Guo, S., & Lu, X. (2025). Epsilon: Exploring Comprehensive Visual-Semantic Projection for Multi-Label Zero-Shot Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 39(18), 19059–19067. https://doi.org/10.1609/aaai.v39i18.34098

Issue

Section

AAAI Technical Track on Machine Learning IV