DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding

Authors

  • Shilong Liu Dept. of CST, BNRist Center, Inst. for AI, Tsinghua-Bosch Joint Center for ML, Tsinghua University International Digital Economy Academy (IDEA)
  • Shijia Huang The Chinese University of Hong Kong
  • Feng Li International Digital Economy Academy (IDEA) The Hong Kong University of Science and Technology
  • Hao Zhang International Digital Economy Academy (IDEA) The Hong Kong University of Science and Technology
  • Yaoyuan Liang Tsinghua-Berkeley Shenzhen Institute, Tsinghua University.
  • Hang Su Dept. of CST, BNRist Center, Inst. for AI, Tsinghua-Bosch Joint Center for ML, Tsinghua University
  • Jun Zhu Dept. of CST, BNRist Center, Inst. for AI, Tsinghua-Bosch Joint Center for ML, Tsinghua University
  • Lei Zhang International Digital Economy Academy (IDEA)

DOI:

https://doi.org/10.1609/aaai.v37i2.25261

Keywords:

CV: Multi-modal Vision, CV: Object Detection & Categorization, CV: Representation Learning for Vision

Abstract

In this paper, we study the problem of visual grounding by considering both phrase extraction and grounding (PEG). In contrast to the previous phrase-known-at-test setting, PEG requires a model to extract phrases from text and locate objects from image simultaneously, which is a more practical setting in real applications. As phrase extraction can be regarded as a 1D text segmentation problem, we formulate PEG as a dual detection problem and propose a novel DQ-DETR model, which introduces dual queries to probe different features from image and text for object prediction and phrase mask prediction. Each pair of dual queries are designed to have shared positional parts but different content parts. Such a design effectively alleviates the difficulty of modality alignment between image and text (in contrast to a single query design) and empowers Transformer decoder to leverage phrase mask-guided attention to improve the performance. To evaluate the performance of PEG, we also propose a new metric CMAP (cross-modal average precision), analogous to the AP metric in object detection. The new metric overcomes the ambiguity of Recall@1 in many-box-to-one-phrase cases in phrase grounding. As a result, our PEG pre-trained DQ-DETR establishes new state-of-the-art results on all visual grounding benchmarks with a ResNet-101 backbone. For example, it achieves 91.04% and 83.51% in terms of recall rate on RefCOCO testA and testB with a ResNet-101 backbone.

Downloads

Published

2023-06-26

How to Cite

Liu, S., Huang, S., Li, F., Zhang, H., Liang, Y., Su, H., Zhu, J., & Zhang, L. (2023). DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding. Proceedings of the AAAI Conference on Artificial Intelligence, 37(2), 1728-1736. https://doi.org/10.1609/aaai.v37i2.25261

Issue

Section

AAAI Technical Track on Computer Vision II