DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding

Shilong Liu; Shijia Huang; Feng Li; Hao Zhang; Yaoyuan Liang; Hang Su; Jun Zhu; Lei Zhang

doi:10.1609/aaai.v37i2.25261

Authors

Shilong Liu Dept. of CST, BNRist Center, Inst. for AI, Tsinghua-Bosch Joint Center for ML, Tsinghua University International Digital Economy Academy (IDEA)
Shijia Huang The Chinese University of Hong Kong
Feng Li International Digital Economy Academy (IDEA) The Hong Kong University of Science and Technology
Hao Zhang International Digital Economy Academy (IDEA) The Hong Kong University of Science and Technology
Yaoyuan Liang Tsinghua-Berkeley Shenzhen Institute, Tsinghua University.
Hang Su Dept. of CST, BNRist Center, Inst. for AI, Tsinghua-Bosch Joint Center for ML, Tsinghua University
Jun Zhu Dept. of CST, BNRist Center, Inst. for AI, Tsinghua-Bosch Joint Center for ML, Tsinghua University
Lei Zhang International Digital Economy Academy (IDEA)

DOI:

https://doi.org/10.1609/aaai.v37i2.25261

Keywords:

CV: Multi-modal Vision, CV: Object Detection & Categorization, CV: Representation Learning for Vision

Abstract

In this paper, we study the problem of visual grounding by considering both phrase extraction and grounding (PEG). In contrast to the previous phrase-known-at-test setting, PEG requires a model to extract phrases from text and locate objects from image simultaneously, which is a more practical setting in real applications. As phrase extraction can be regarded as a 1D text segmentation problem, we formulate PEG as a dual detection problem and propose a novel DQ-DETR model, which introduces dual queries to probe different features from image and text for object prediction and phrase mask prediction. Each pair of dual queries are designed to have shared positional parts but different content parts. Such a design effectively alleviates the difficulty of modality alignment between image and text (in contrast to a single query design) and empowers Transformer decoder to leverage phrase mask-guided attention to improve the performance. To evaluate the performance of PEG, we also propose a new metric CMAP (cross-modal average precision), analogous to the AP metric in object detection. The new metric overcomes the ambiguity of Recall@1 in many-box-to-one-phrase cases in phrase grounding. As a result, our PEG pre-trained DQ-DETR establishes new state-of-the-art results on all visual grounding benchmarks with a ResNet-101 backbone. For example, it achieves 91.04% and 83.51% in terms of recall rate on RefCOCO testA and testB with a ResNet-101 backbone.

DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription