Manipulation Intention Understanding for Zero-Shot Composed Image Retrieval

Yuanmin Tang; Jing Yu; Keke Gai; Gang Xiong; Gaopeng Gou; Meikang Qiu; Qi Wu

doi:10.1609/aaai.v40i11.37907

Authors

Yuanmin Tang Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences
Jing Yu Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance of MOE, Minzu University of China School of Information Engineering, Minzu University of China
Keke Gai School of AI, Beijing Institute of Technology, Beijing 100081, China Zhongguancun Academy, Haidian, Beijing, China
Gang Xiong Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences
Gaopeng Gou Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences
Meikang Qiu School of Computer and Cyber Sciences Augusta University Augusta, Georgia, USA
Qi Wu Responsible AI Research Centre, Adelaide University

DOI:

https://doi.org/10.1609/aaai.v40i11.37907

Abstract

Zero-shot Composed Image Retrieval (ZS-CIR) involves diverse tasks with varied visual manipulation intents across domains, scenes, objects, and attributes. A key challenge is that existing datasets contain limited intent-relevant annotations, making it hard for models to infer human intent from textual modifications. We introduce an intent-centric image–text dataset generated via reasoning by a Multimodal Large Language Model (MLLM) to better train ZS-CIR models for human manipulation intent understanding. Building on this dataset, we propose De-MINDS, a framework that distills the MLLM’s reasoning ability to capture manipulation intent and enhance models’ comprehension of modified text. A simple mapping network translates image information into language space and combines it with the manipulation text to form a query. De-MINDS then extracts intention-relevant information from this query and encodes it as pseudo-word tokens for accurate ZS-CIR. Across four ZS-CIR tasks, De-MINDS shows strong generalization and improves over existing methods by 2.15% to 4.05%, establishing new state-of-the-art results with comparable inference time.

Manipulation Intention Understanding for Zero-Shot Composed Image Retrieval

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information