Suit the Remedy to the Retriever: Interpretable Query Optimization with Retriever Preference Alignment for Vision-Language Retrieval

GuangHao Meng; Jinpeng Wang; Jieming Zhu; Letian Zhang; Yong Jiang; Dan Zhao; Qing Li

doi:10.1609/aaai.v40i10.37741

Authors

GuangHao Meng Tsinghua University Pengcheng Laboratory
Jinpeng Wang Harbin Institute of Technology, Shenzhen
Jieming Zhu Huawei Noah's Ark Lab
Letian Zhang Tsinghua University
Yong Jiang Tsinghua University Pengcheng Laboratory
Dan Zhao Pengcheng Laboratory
Qing Li Pengcheng Laboratory

DOI:

https://doi.org/10.1609/aaai.v40i10.37741

Abstract

Vision-language retrieval (VLR), which uses text or image queries to retrieve corresponding cross-modal content, plays a crucial role in multimedia and computer vision tasks. However, challenging concepts in queries often confuse retrievers, limiting their ability to align concepts with visual content. Existing query optimization methods neglect retrievers’ preferences (i.e., text descriptions that better match their corresponding visual content), resulting in unadapted to the retriever and leading to suboptimal performance. To address this, we propose the Retriever-Adaptive Query Optimization (RAQO), an interpretable framework that rewrites queries based on retriever-specific preferences. Specifically, we first leverages multimodal large language Models (MLLMs) and retrieval's feedback to construct the MLLMs-Driven Preference-Aware Dataset Engine (MPADE), which automatically refine queries offline, capturing the retriever’s implicit preferences. Then, we introduce a ``detect-then-rewrite" chain-of-thought rewriting (ReCoT) strategy equipped with a progressive preference alignment pipeline, including three stages: ambiguity detection fine-tuning, query rewriting fine-tuning, and preference rank optimization. This design enables the rewriter to focus on confusing concepts and produce retriever-adapted, high-quality queries. Extensive VLR benchmark experiments have demonstrated the superiority of RAQO in cross-modal retrieval, as well as its interpretability, generalizability and transferability.

Suit the Remedy to the Retriever: Interpretable Query Optimization with Retriever Preference Alignment for Vision-Language Retrieval

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information