Look Around Before Locating: Considering Content and Structure Information for Visual Grounding

Shiyi Zheng; Peizhi Zhao; Zhilong Zheng; Peihang He; Haonan Cheng; Yi Cai; Qingbao Huang

doi:10.1609/aaai.v39i2.32158

Authors

Shiyi Zheng School of Electrical Engineering, Guangxi University, Nanning, China
Peizhi Zhao School of Electrical Engineering, Guangxi University, Nanning, China
Zhilong Zheng School of Electrical Engineering, Guangxi University, Nanning, China
Peihang He School of Electrical Engineering, Guangxi University, Nanning, China
Haonan Cheng State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing, China
Yi Cai Key Laboratory of Big Data and Intelligent Robot of Ministry of Education, SCUT, Guangzhou, China School of Software Engineering, South China University of Technology, Guangzhou, China
Qingbao Huang School of Electrical Engineering, Guangxi University, Nanning, China Guangxi Key Laboratory of Multimedia Communications and Network Technology, Nanning, China

DOI:

https://doi.org/10.1609/aaai.v39i2.32158

Abstract

As a long-term challenge and fundamental requirement in vision and language tasks, visual grounding aims to localize a target referred by a natural language query. The regional annotations form a superficial correlation between the subject of expression and some common visual entities, which hinder models from comprehending the linguistic content and structure. However, current one-stage methods struggle to uniformly model the visual and linguistic structure due to the structural gap between continuous image patches and discrete text tokens. In this paper, we propose a semi-structured reasoning framework for visual grounding to gradually comprehend the linguistic content and structure. Specifically, we devise a cross-modal content alignment module to effectively align unlabeled contextual information into a stable semantic space corrected by token-level prior knowledge obtained with CLIP. A multi-branch modulated localization module is also established to obtain modulation grounding by linguistic structure. Through a soft split mechanism, our method can destructure the expression into a fixed semi-structure (i.e., subject and context) while ensuring the completeness of linguistic content. Our method is thus capable of building a semi-structured reasoning system to effectively comprehend the linguistic content and structure by content alignment and structure modulated grounding. Experimental results on five widely-used datasets validate the performance improvements of our proposed method.

Look Around Before Locating: Considering Content and Structure Information for Visual Grounding

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information