Look Around Before Locating: Considering Content and Structure Information for Visual Grounding
DOI:
https://doi.org/10.1609/aaai.v39i2.32158Abstract
As a long-term challenge and fundamental requirement in vision and language tasks, visual grounding aims to localize a target referred by a natural language query. The regional annotations form a superficial correlation between the subject of expression and some common visual entities, which hinder models from comprehending the linguistic content and structure. However, current one-stage methods struggle to uniformly model the visual and linguistic structure due to the structural gap between continuous image patches and discrete text tokens. In this paper, we propose a semi-structured reasoning framework for visual grounding to gradually comprehend the linguistic content and structure. Specifically, we devise a cross-modal content alignment module to effectively align unlabeled contextual information into a stable semantic space corrected by token-level prior knowledge obtained with CLIP. A multi-branch modulated localization module is also established to obtain modulation grounding by linguistic structure. Through a soft split mechanism, our method can destructure the expression into a fixed semi-structure (i.e., subject and context) while ensuring the completeness of linguistic content. Our method is thus capable of building a semi-structured reasoning system to effectively comprehend the linguistic content and structure by content alignment and structure modulated grounding. Experimental results on five widely-used datasets validate the performance improvements of our proposed method.Downloads
Published
2025-04-11
How to Cite
Zheng, S., Zhao, P., Zheng, Z., He, P., Cheng, H., Cai, Y., & Huang, Q. (2025). Look Around Before Locating: Considering Content and Structure Information for Visual Grounding. Proceedings of the AAAI Conference on Artificial Intelligence, 39(2), 1656–1664. https://doi.org/10.1609/aaai.v39i2.32158
Issue
Section
AAAI Technical Track on Cognitive Modeling & Cognitive Systems