Look Around Before Locating: Considering Content and Structure Information for Visual Grounding

Authors

  • Shiyi Zheng School of Electrical Engineering, Guangxi University, Nanning, China
  • Peizhi Zhao School of Electrical Engineering, Guangxi University, Nanning, China
  • Zhilong Zheng School of Electrical Engineering, Guangxi University, Nanning, China
  • Peihang He School of Electrical Engineering, Guangxi University, Nanning, China
  • Haonan Cheng State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing, China
  • Yi Cai Key Laboratory of Big Data and Intelligent Robot of Ministry of Education, SCUT, Guangzhou, China School of Software Engineering, South China University of Technology, Guangzhou, China
  • Qingbao Huang School of Electrical Engineering, Guangxi University, Nanning, China Guangxi Key Laboratory of Multimedia Communications and Network Technology, Nanning, China

DOI:

https://doi.org/10.1609/aaai.v39i2.32158

Abstract

As a long-term challenge and fundamental requirement in vision and language tasks, visual grounding aims to localize a target referred by a natural language query. The regional annotations form a superficial correlation between the subject of expression and some common visual entities, which hinder models from comprehending the linguistic content and structure. However, current one-stage methods struggle to uniformly model the visual and linguistic structure due to the structural gap between continuous image patches and discrete text tokens. In this paper, we propose a semi-structured reasoning framework for visual grounding to gradually comprehend the linguistic content and structure. Specifically, we devise a cross-modal content alignment module to effectively align unlabeled contextual information into a stable semantic space corrected by token-level prior knowledge obtained with CLIP. A multi-branch modulated localization module is also established to obtain modulation grounding by linguistic structure. Through a soft split mechanism, our method can destructure the expression into a fixed semi-structure (i.e., subject and context) while ensuring the completeness of linguistic content. Our method is thus capable of building a semi-structured reasoning system to effectively comprehend the linguistic content and structure by content alignment and structure modulated grounding. Experimental results on five widely-used datasets validate the performance improvements of our proposed method.

Published

2025-04-11

How to Cite

Zheng, S., Zhao, P., Zheng, Z., He, P., Cheng, H., Cai, Y., & Huang, Q. (2025). Look Around Before Locating: Considering Content and Structure Information for Visual Grounding. Proceedings of the AAAI Conference on Artificial Intelligence, 39(2), 1656–1664. https://doi.org/10.1609/aaai.v39i2.32158

Issue

Section

AAAI Technical Track on Cognitive Modeling & Cognitive Systems