Suit the Remedy to the Retriever: Interpretable Query Optimization with Retriever Preference Alignment for Vision-Language Retrieval

Authors

  • GuangHao Meng Tsinghua University Pengcheng Laboratory
  • Jinpeng Wang Harbin Institute of Technology, Shenzhen
  • Jieming Zhu Huawei Noah's Ark Lab
  • Letian Zhang Tsinghua University
  • Yong Jiang Tsinghua University Pengcheng Laboratory
  • Dan Zhao Pengcheng Laboratory
  • Qing Li Pengcheng Laboratory

DOI:

https://doi.org/10.1609/aaai.v40i10.37741

Abstract

Vision-language retrieval (VLR), which uses text or image queries to retrieve corresponding cross-modal content, plays a crucial role in multimedia and computer vision tasks. However, challenging concepts in queries often confuse retrievers, limiting their ability to align concepts with visual content. Existing query optimization methods neglect retrievers’ preferences (i.e., text descriptions that better match their corresponding visual content), resulting in unadapted to the retriever and leading to suboptimal performance. To address this, we propose the Retriever-Adaptive Query Optimization (RAQO), an interpretable framework that rewrites queries based on retriever-specific preferences. Specifically, we first leverages multimodal large language Models (MLLMs) and retrieval's feedback to construct the MLLMs-Driven Preference-Aware Dataset Engine (MPADE), which automatically refine queries offline, capturing the retriever’s implicit preferences. Then, we introduce a ``detect-then-rewrite" chain-of-thought rewriting (ReCoT) strategy equipped with a progressive preference alignment pipeline, including three stages: ambiguity detection fine-tuning, query rewriting fine-tuning, and preference rank optimization. This design enables the rewriter to focus on confusing concepts and produce retriever-adapted, high-quality queries. Extensive VLR benchmark experiments have demonstrated the superiority of RAQO in cross-modal retrieval, as well as its interpretability, generalizability and transferability.

Published

2026-03-14

How to Cite

Meng, G., Wang, J., Zhu, J., Zhang, L., Jiang, Y., Zhao, D., & Li, Q. (2026). Suit the Remedy to the Retriever: Interpretable Query Optimization with Retriever Preference Alignment for Vision-Language Retrieval. Proceedings of the AAAI Conference on Artificial Intelligence, 40(10), 7963–7971. https://doi.org/10.1609/aaai.v40i10.37741

Issue

Section

AAAI Technical Track on Computer Vision VII