Proxy Zero-Shot Hashing with Multimodal Fusion via Stable Diffusion

Authors

  • Hui Zhang Ocean University of China
  • Weikang Gao Ocean University of China
  • Tao Yang Ocean University of China
  • Yuan Cao Ocean University of China

DOI:

https://doi.org/10.1609/aaai.v40i15.38247

Abstract

With the rapid growth of visual content in open-world environments, zero-shot hashing image retrieval (ZSHIR) has emerged to tackle the challenge of recognizing novel classes using attribute-level and semantic information. However, existing methods often rely on shallow fusion of multi-source cues (e.g., attributes, labels, and visual features) through external supervision or feature concatenation, failing to capture the underlying semantic structure in a generative way. Particularly, current bridging strategies between modalities suffer from information fragmentation and weak alignment, hindering the model's ability to fully understand complex attribute-visual relations. Moreover, subtle semantic gaps or “semantic drift” between seen and unseen classes further degrade inter-class separability and the scalability of hashing models. To address these issues, we propose a novel framework called Proxy Zero-Shot Hashing with Multimodal Fusion via Stable Diffusion (PZSH), which integrates generative modeling and contrastive learning. PZSH leverages a pre-trained Stable Diffusion (SD) model to synthesize multimodal content, and uses dual BLIP encoders to enhance semantic alignment across modalities. We further design a proxy hashing loss to enforce discriminative binary representations. Extensive experiments on benchmark datasets show that PZSH achieves state-of-the-art performance with stronger generalization to unseen classes.

Downloads

Published

2026-03-14

How to Cite

Zhang, H., Gao, W., Yang, T., & Cao, Y. (2026). Proxy Zero-Shot Hashing with Multimodal Fusion via Stable Diffusion. Proceedings of the AAAI Conference on Artificial Intelligence, 40(15), 12529–12537. https://doi.org/10.1609/aaai.v40i15.38247

Issue

Section

AAAI Technical Track on Computer Vision XII