Imagine with Layout and Sketch: Enhancing Vision-Language Retrieval with Dual-Stream Multi-Modal Query Refinement

Authors

  • GuangHao Meng Tsinghua University Pengcheng Laborotary
  • Jinpeng Wang Harbin Institute of Technology, Shenzhen
  • Qian-Wei Wang Tsinghua University
  • XuDong Ren Tsinghua University
  • Dan Zhao Pengcheng Laborotary

DOI:

https://doi.org/10.1609/aaai.v40i10.37742

Abstract

Vision-Language Retrieval (VLR) aims to retrieve relevant visual or textual information from multimodal data using language or image queries. However, traditional VLR methods often rely on data-driven shallow semantic alignment and fail to understand the deeper structural and fine-grained entity features of queries, resulting in poor performance on multi-entity layouts and challenging entities. In this paper, we propose the Layout-Aware and Sketch-Enhanced (LASE) VLR framework, which refines query representations by incorporating multimodal layout and sketch knowledge. Specifically, layout knowledge encodes the spatial arrangement of entities, while sketch knowledge refines entity perception by capturing essential structural details. To extract these knowledge representations, we leverage Large Language Models' (LLMs) powerful semantic understanding for layout generation, and Diffusion Models' (DMs) fine-grained cross-modal generative capabilities for sketch generation. However, integrating knowledge into queries may introduce biases and query-specific preferences due to varying visual content and knowledge demands. To address this, we propose the Gated Dual-Stream Knowledge Module (GDKM), which consists of a multi-instance fusion network with a sample-aware gating network. The fusion network aggregates diverse knowledge using multi-head attention to reduce bias, while the gating network adjusts knowledge weights based on query characteristics. Extensive experiments demonstrate that the LASE significantly enhances VLR performance across multiple benchmarks, with superior generalization and transferability.

Published

2026-03-14

How to Cite

Meng, G., Wang, J., Wang, Q.-W., Ren, X., & Zhao, D. (2026). Imagine with Layout and Sketch: Enhancing Vision-Language Retrieval with Dual-Stream Multi-Modal Query Refinement. Proceedings of the AAAI Conference on Artificial Intelligence, 40(10), 7972-7980. https://doi.org/10.1609/aaai.v40i10.37742

Issue

Section

AAAI Technical Track on Computer Vision VII