Rethinking Two-Stage Referring Expression Comprehension: A Novel Grounding and Segmentation Method Modulated by Point

Authors

  • Peizhi Zhao School of Electrical Engineering, Guangxi University, Nanning, Guangxi, China
  • Shiyi Zheng School of Electrical Engineering, Guangxi University, Nanning, Guangxi, China
  • Wenye Zhao School of Electrical Engineering, Guangxi University, Nanning, Guangxi, China
  • Dongsheng Xu School of Electrical Engineering, Guangxi University, Nanning, Guangxi, China
  • Pijian Li School of Electrical Engineering, Guangxi University, Nanning, Guangxi, China
  • Yi Cai School of Software Engineering, South China University of Technology Key Laboratory of Big Data and Intelligent Robot
  • Qingbao Huang School of Electrical Engineering, Guangxi University, Nanning, Guangxi, China Guangxi Key Laboratory of Multimedia Communications and Network Technology

DOI:

https://doi.org/10.1609/aaai.v38i7.28580

Keywords:

CV: Language and Vision, NLP: Language Grounding & Multi-modal NLP

Abstract

As a fundamental and challenging task in the vision and language domain, Referring Expression Comprehension (REC) has shown impressive improvements recently. However, for a complex task that couples the comprehension of abstract concepts and the localization of concrete instances, one-stage approaches are bottlenecked by computing and data resources. To obtain a low-cost solution, the prevailing two-stage approaches decouple REC into localization (region proposal) and comprehension (region-expression matching) at region-level, but the solution based on isolated regions cannot sufficiently utilize the context and is usually limited by the quality of proposals. Therefore, it is necessary to rebuild an efficient two-stage solution system. In this paper, we propose a point-based two-stage framework for REC, in which the two stages are redefined as point-based cross-modal comprehension and point-based instance localization. Specifically, we reconstruct the raw bounding box and segmentation mask into center and mass scores as soft ground-truth for measuring point-level cross-modal correlations. With the soft ground-truth, REC can be approximated as a binary classification problem, which fundamentally avoids the impact of isolated regions on the optimization process. Remarkably, the consistent metrics between center and mass scores allow our system to directly optimize grounding and segmentation by utilizing the same architecture. Experiments on multiple benchmarks show the feasibility and potential of our point-based paradigm. Our code available at https://github.com/VILAN-Lab/PBREC-MT.

Published

2024-03-24

How to Cite

Zhao, P., Zheng, S., Zhao, W., Xu, D., Li, P., Cai, Y., & Huang, Q. (2024). Rethinking Two-Stage Referring Expression Comprehension: A Novel Grounding and Segmentation Method Modulated by Point. Proceedings of the AAAI Conference on Artificial Intelligence, 38(7), 7487-7495. https://doi.org/10.1609/aaai.v38i7.28580

Issue

Section

AAAI Technical Track on Computer Vision VI