Heterogeneous Prompt-Guided Entity Inferring and Distilling for Scene-Text Aware Cross-Modal Retrieval

Authors

  • Zhiqian Zhao Hangzhou Dianzi University
  • Liang Li Institute of Computing Technology, Chinese Academy of Sciences
  • Jiehua Zhang School of Software Engineering, Xi'an Jiaotong University
  • Yaoqi Sun Hangzhou Dianzi University Lishui Institute of Hangzhou Dianzi University
  • Xichun Sheng Macao Polytechnic University Lishui Institute of Hangzhou Dianzi University
  • Haibing Yin Hangzhou Dianzi University Lishui Institute of Hangzhou Dianzi University
  • Shaowei Jiang Hangzhou Dianzi University Lishui Institute of Hangzhou Dianzi University

DOI:

https://doi.org/10.1609/aaai.v39i10.33144

Abstract

In cross-modal retrieval, comprehensive image understanding is vital while the scene text in images can provide fine-grained information to understand visual semantics. Current methods fail to make full use of scene text. They suffer from the semantic ambiguity of independent scene text and overlook the heterogeneous concepts in image-caption pairs. In this paper, we propose a heterogeneous prompt-guided entity inferring and distilling (HOPID) network to explore the nature connection of scene text in images and captions and learn a property-centric scene text representation. Specifically, we propose to align scene text in images and captions via heterogeneous prompt, which consists of visual and text prompt. For text prompt, we introduce the discriminative entity inferring module to reason key scene text words from captions, while visual prompt highlights the corresponding scene text in images. Furthermore, to secure a robust scene text representation, we design a perceptive entity distilling module that distills the beneficial information of scene text at a fine-grained level. Extensive experiments show that the proposed method significantly outperforms existing approaches on two public cross-modal retrieval benchmarks.

Downloads

Published

2025-04-11

How to Cite

Zhao, Z., Li, L., Zhang, J., Sun, Y., Sheng, X., Yin, H., & Jiang, S. (2025). Heterogeneous Prompt-Guided Entity Inferring and Distilling for Scene-Text Aware Cross-Modal Retrieval. Proceedings of the AAAI Conference on Artificial Intelligence, 39(10), 10537–10545. https://doi.org/10.1609/aaai.v39i10.33144

Issue

Section

AAAI Technical Track on Computer Vision IX