Imagine with Layout and Sketch: Enhancing Vision-Language Retrieval with Dual-Stream Multi-Modal Query Refinement
DOI:
https://doi.org/10.1609/aaai.v40i10.37742Abstract
Vision-Language Retrieval (VLR) aims to retrieve relevant visual or textual information from multimodal data using language or image queries. However, traditional VLR methods often rely on data-driven shallow semantic alignment and fail to understand the deeper structural and fine-grained entity features of queries, resulting in poor performance on multi-entity layouts and challenging entities. In this paper, we propose the Layout-Aware and Sketch-Enhanced (LASE) VLR framework, which refines query representations by incorporating multimodal layout and sketch knowledge. Specifically, layout knowledge encodes the spatial arrangement of entities, while sketch knowledge refines entity perception by capturing essential structural details. To extract these knowledge representations, we leverage Large Language Models' (LLMs) powerful semantic understanding for layout generation, and Diffusion Models' (DMs) fine-grained cross-modal generative capabilities for sketch generation. However, integrating knowledge into queries may introduce biases and query-specific preferences due to varying visual content and knowledge demands. To address this, we propose the Gated Dual-Stream Knowledge Module (GDKM), which consists of a multi-instance fusion network with a sample-aware gating network. The fusion network aggregates diverse knowledge using multi-head attention to reduce bias, while the gating network adjusts knowledge weights based on query characteristics. Extensive experiments demonstrate that the LASE significantly enhances VLR performance across multiple benchmarks, with superior generalization and transferability.Downloads
Published
2026-03-14
How to Cite
Meng, G., Wang, J., Wang, Q.-W., Ren, X., & Zhao, D. (2026). Imagine with Layout and Sketch: Enhancing Vision-Language Retrieval with Dual-Stream Multi-Modal Query Refinement. Proceedings of the AAAI Conference on Artificial Intelligence, 40(10), 7972-7980. https://doi.org/10.1609/aaai.v40i10.37742
Issue
Section
AAAI Technical Track on Computer Vision VII