Rethinking Two-Stage Referring Expression Comprehension: A Novel Grounding and Segmentation Method Modulated by Point

Peizhi Zhao; Shiyi Zheng; Wenye Zhao; Dongsheng Xu; Pijian Li; Yi Cai; Qingbao Huang

doi:10.1609/aaai.v38i7.28580

Authors

Peizhi Zhao School of Electrical Engineering, Guangxi University, Nanning, Guangxi, China
Shiyi Zheng School of Electrical Engineering, Guangxi University, Nanning, Guangxi, China
Wenye Zhao School of Electrical Engineering, Guangxi University, Nanning, Guangxi, China
Dongsheng Xu School of Electrical Engineering, Guangxi University, Nanning, Guangxi, China
Pijian Li School of Electrical Engineering, Guangxi University, Nanning, Guangxi, China
Yi Cai School of Software Engineering, South China University of Technology Key Laboratory of Big Data and Intelligent Robot
Qingbao Huang School of Electrical Engineering, Guangxi University, Nanning, Guangxi, China Guangxi Key Laboratory of Multimedia Communications and Network Technology

DOI:

https://doi.org/10.1609/aaai.v38i7.28580

Keywords:

CV: Language and Vision, NLP: Language Grounding & Multi-modal NLP

Abstract

As a fundamental and challenging task in the vision and language domain, Referring Expression Comprehension (REC) has shown impressive improvements recently. However, for a complex task that couples the comprehension of abstract concepts and the localization of concrete instances, one-stage approaches are bottlenecked by computing and data resources. To obtain a low-cost solution, the prevailing two-stage approaches decouple REC into localization (region proposal) and comprehension (region-expression matching) at region-level, but the solution based on isolated regions cannot sufficiently utilize the context and is usually limited by the quality of proposals. Therefore, it is necessary to rebuild an efficient two-stage solution system. In this paper, we propose a point-based two-stage framework for REC, in which the two stages are redefined as point-based cross-modal comprehension and point-based instance localization. Specifically, we reconstruct the raw bounding box and segmentation mask into center and mass scores as soft ground-truth for measuring point-level cross-modal correlations. With the soft ground-truth, REC can be approximated as a binary classification problem, which fundamentally avoids the impact of isolated regions on the optimization process. Remarkably, the consistent metrics between center and mass scores allow our system to directly optimize grounding and segmentation by utilizing the same architecture. Experiments on multiple benchmarks show the feasibility and potential of our point-based paradigm. Our code available at https://github.com/VILAN-Lab/PBREC-MT.

Rethinking Two-Stage Referring Expression Comprehension: A Novel Grounding and Segmentation Method Modulated by Point

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription