Imagine with Layout and Sketch: Enhancing Vision-Language Retrieval with Dual-Stream Multi-Modal Query Refinement

GuangHao Meng; Jinpeng Wang; Qian-Wei Wang; XuDong Ren; Dan Zhao

doi:10.1609/aaai.v40i10.37742

Authors

GuangHao Meng Tsinghua University Pengcheng Laborotary
Jinpeng Wang Harbin Institute of Technology, Shenzhen
Qian-Wei Wang Tsinghua University
XuDong Ren Tsinghua University
Dan Zhao Pengcheng Laborotary

DOI:

https://doi.org/10.1609/aaai.v40i10.37742

Abstract

Vision-Language Retrieval (VLR) aims to retrieve relevant visual or textual information from multimodal data using language or image queries. However, traditional VLR methods often rely on data-driven shallow semantic alignment and fail to understand the deeper structural and fine-grained entity features of queries, resulting in poor performance on multi-entity layouts and challenging entities. In this paper, we propose the Layout-Aware and Sketch-Enhanced (LASE) VLR framework, which refines query representations by incorporating multimodal layout and sketch knowledge. Specifically, layout knowledge encodes the spatial arrangement of entities, while sketch knowledge refines entity perception by capturing essential structural details. To extract these knowledge representations, we leverage Large Language Models' (LLMs) powerful semantic understanding for layout generation, and Diffusion Models' (DMs) fine-grained cross-modal generative capabilities for sketch generation. However, integrating knowledge into queries may introduce biases and query-specific preferences due to varying visual content and knowledge demands. To address this, we propose the Gated Dual-Stream Knowledge Module (GDKM), which consists of a multi-instance fusion network with a sample-aware gating network. The fusion network aggregates diverse knowledge using multi-head attention to reduce bias, while the gating network adjusts knowledge weights based on query characteristics. Extensive experiments demonstrate that the LASE significantly enhances VLR performance across multiple benchmarks, with superior generalization and transferability.

Imagine with Layout and Sketch: Enhancing Vision-Language Retrieval with Dual-Stream Multi-Modal Query Refinement

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information