Modality and Task Adaptation for Enhanced Zero-shot Composed Image Retrieval

Authors

  • Haiwen Li Beijing University of Posts and Telecommunications
  • Delong Liu Beijing University of Posts and Telecommunications
  • Zhaohui Hou SenseTime
  • Zeliang Ma SenseTime
  • Fei Su Beijing University of Posts and Telecommunications Beijing Key Laboratory of Network System and Network Culture Key Laboratory of Intereactive Technology and Experience System
  • Zhicheng Zhao Beijing University of Posts and Telecommunications Beijing Key Laboratory of Network System and Network Culture Key Laboratory of Intereactive Technology and Experience System

DOI:

https://doi.org/10.1609/aaai.v40i8.37534

Abstract

As a challenging vision-language task, Zero-Shot Composed Image Retrieval (ZS-CIR) is designed to retrieve target images using bi-modal (image+text) queries. Typical ZS-CIR methods employ an inversion network to generate pseudo-word tokens that effectively represent the input semantics. However, the inversion-based methods suffer from two inherent issues: First, the task discrepancy exists because inversion training and CIR inference involve different objectives. Second, the modality discrepancy arises from the input feature distribution mismatch between training and inference. To this end, we propose a lightweight post-hoc framework, consisting of two components: (1) A new text-anchored triplet construction pipeline leverages a large language model (LLM) to transform a standard image-text dataset into a triplet dataset, where a textual description serves as the target of each triplet. (2) The MoTa-Adapter, a novel parameter-efficient fine-tuning method, adapts the dual encoder to the CIR task using our constructed triplet data. Specifically, on the text side, multiple sets of learnable task prompts are integrated via a Mixture-of-Experts (MoE) layer to capture task-specific priors and handle different types of modifications. On the image side, MoTa-Adapter modulates the inversion network's input to better match the downstream text encoder. In addition, an entropy-based optimization strategy is proposed to assign greater weight to challenging samples, thus improving adaptation efficiency. Experiments show that, with the incorporation of our proposed components, inversion-based methods achieve significant improvements, reaching state-of-the-art performance across four widely-used benchmarks.

Downloads

Published

2026-03-14

How to Cite

Li, H., Liu, D., Hou, Z., Ma, Z., Su, F., & Zhao, Z. (2026). Modality and Task Adaptation for Enhanced Zero-shot Composed Image Retrieval. Proceedings of the AAAI Conference on Artificial Intelligence, 40(8), 6100-6108. https://doi.org/10.1609/aaai.v40i8.37534

Issue

Section

AAAI Technical Track on Computer Vision V