Adaptive Prompt-Based Semantic Embedding with Inspire Potential of Implicit Knowledge for Cross-Modal Retrieval

Xin Huang; Shilong Wang; Tong Jia; Zhihang Gou; Jingjing Li

doi:10.1609/aaai.v39i16.33922

Authors

Xin Huang Nanyang Normal University
Shilong Wang Nanyang Normal University
Tong Jia Peking University
Zhihang Gou Nanyang Normal University
Jingjing Li Nanyang Normal University

DOI:

https://doi.org/10.1609/aaai.v39i16.33922

Abstract

In the era of big data, cross-modal retrieval is increasingly important in research and application. Given the latent complexity and non-intuitive nature of cross-modal relationships, leveraging external knowledge such as large models has become a popular approach to facilitate modality alignment. Existing methods typically address these challenges by fine-tuning model encoders or using a fixed number of prompts. However, these approaches struggle with the significant information asymmetry between image-text pairs and the high distribution diversity of image data. These limitations not only introduce noise during training but also reduce the accuracy and generalization capabilities in cross-modal retrieval tasks. To address the above issues, this paper proposes Adaptive Prompt-Based Semantic Embedding with Inspired Potential of Implicit Knowledge (APSE-IPIK). On one hand, we propose an inspired potential strategy to extract fine-grained and multi-perspective text descriptions from large-scale pre-trained multimodal models, which can be seen as implicit knowledge injection. These descriptions are integrated into the visual-semantic embedding through cross-modal semantic alignment with images, balancing the information asymmetry between modalities and reducing the embedding of inaccurate mapping relationships. On the other hand, we construct an instance-level query-based prompt pool strategy to adaptively extract the most relevant prompts, addressing alignment biases caused by intra-modal (especially image) data diversity and improving alignment accuracy. Extensive experiments are conducted on two widely used datasets, Flickr30k and MSCOCO, which show the effectiveness of the proposed method.

Adaptive Prompt-Based Semantic Embedding with Inspire Potential of Implicit Knowledge for Cross-Modal Retrieval

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information