Modality and Task Adaptation for Enhanced Zero-shot Composed Image Retrieval

Haiwen Li; Delong Liu; Zhaohui Hou; Zeliang Ma; Fei Su; Zhicheng Zhao

doi:10.1609/aaai.v40i8.37534

Authors

Haiwen Li Beijing University of Posts and Telecommunications
Delong Liu Beijing University of Posts and Telecommunications
Zhaohui Hou SenseTime
Zeliang Ma SenseTime
Fei Su Beijing University of Posts and Telecommunications Beijing Key Laboratory of Network System and Network Culture Key Laboratory of Intereactive Technology and Experience System
Zhicheng Zhao Beijing University of Posts and Telecommunications Beijing Key Laboratory of Network System and Network Culture Key Laboratory of Intereactive Technology and Experience System

DOI:

https://doi.org/10.1609/aaai.v40i8.37534

Abstract

As a challenging vision-language task, Zero-Shot Composed Image Retrieval (ZS-CIR) is designed to retrieve target images using bi-modal (image+text) queries. Typical ZS-CIR methods employ an inversion network to generate pseudo-word tokens that effectively represent the input semantics. However, the inversion-based methods suffer from two inherent issues: First, the task discrepancy exists because inversion training and CIR inference involve different objectives. Second, the modality discrepancy arises from the input feature distribution mismatch between training and inference. To this end, we propose a lightweight post-hoc framework, consisting of two components: (1) A new text-anchored triplet construction pipeline leverages a large language model (LLM) to transform a standard image-text dataset into a triplet dataset, where a textual description serves as the target of each triplet. (2) The MoTa-Adapter, a novel parameter-efficient fine-tuning method, adapts the dual encoder to the CIR task using our constructed triplet data. Specifically, on the text side, multiple sets of learnable task prompts are integrated via a Mixture-of-Experts (MoE) layer to capture task-specific priors and handle different types of modifications. On the image side, MoTa-Adapter modulates the inversion network's input to better match the downstream text encoder. In addition, an entropy-based optimization strategy is proposed to assign greater weight to challenging samples, thus improving adaptation efficiency. Experiments show that, with the incorporation of our proposed components, inversion-based methods achieve significant improvements, reaching state-of-the-art performance across four widely-used benchmarks.

Modality and Task Adaptation for Enhanced Zero-shot Composed Image Retrieval

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information