Proxy Zero-Shot Hashing with Multimodal Fusion via Stable Diffusion

Hui Zhang; Weikang Gao; Tao Yang; Yuan Cao

doi:10.1609/aaai.v40i15.38247

Authors

Hui Zhang Ocean University of China
Weikang Gao Ocean University of China
Tao Yang Ocean University of China
Yuan Cao Ocean University of China

DOI:

https://doi.org/10.1609/aaai.v40i15.38247

Abstract

With the rapid growth of visual content in open-world environments, zero-shot hashing image retrieval (ZSHIR) has emerged to tackle the challenge of recognizing novel classes using attribute-level and semantic information. However, existing methods often rely on shallow fusion of multi-source cues (e.g., attributes, labels, and visual features) through external supervision or feature concatenation, failing to capture the underlying semantic structure in a generative way. Particularly, current bridging strategies between modalities suffer from information fragmentation and weak alignment, hindering the model's ability to fully understand complex attribute-visual relations. Moreover, subtle semantic gaps or “semantic drift” between seen and unseen classes further degrade inter-class separability and the scalability of hashing models. To address these issues, we propose a novel framework called Proxy Zero-Shot Hashing with Multimodal Fusion via Stable Diffusion (PZSH), which integrates generative modeling and contrastive learning. PZSH leverages a pre-trained Stable Diffusion (SD) model to synthesize multimodal content, and uses dual BLIP encoders to enhance semantic alignment across modalities. We further design a proxy hashing loss to enforce discriminative binary representations. Extensive experiments on benchmark datasets show that PZSH achieves state-of-the-art performance with stronger generalization to unseen classes.

Proxy Zero-Shot Hashing with Multimodal Fusion via Stable Diffusion

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information