DoGA: Enhancing Grounded Object Detection via Grouped Pre-Training with Attributes

Yang Liu; Feng Hou; Yunjie Peng; Gangjian Zhang; Yao Zhang; Dong Xie; Peng Wang; Yang Zhang; Jiang Tian; Zhongchao Shi; Jianping Fan; Zhiqiang He

doi:10.1609/aaai.v39i6.32603

Authors

Yang Liu Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences
Feng Hou Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences
Yunjie Peng Beihang University
Gangjian Zhang AI Lab, Lenovo Research
Yao Zhang AI Lab, Lenovo Research
Dong Xie AI Lab, Lenovo Research
Peng Wang AI Lab, Lenovo Research
Yang Zhang AI Lab, Lenovo Research
Jiang Tian AI Lab, Lenovo Research
Zhongchao Shi AI Lab, Lenovo Research
Jianping Fan AI Lab, Lenovo Research
Zhiqiang He Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences Lenovo Ltd.

DOI:

https://doi.org/10.1609/aaai.v39i6.32603

Abstract

Recent advances in vision-language pre-training have significantly enhanced the model capabilities on grounded object detection. However, these studies often pre-train with coarse-grained text prompts, such as plain category names and brief grounded phrases. This limitation curtails the model's capacity for fine-grained linguistic comprehension and leads to a significant decline in performance when faced with detailed descriptions or contextual information. To tackle these problems, we develop DoGA: Detect objects with Grouped Attributes, which employs commonly apparent attributes to bridge different granular semantics and uses specific attributes to identify the object discrepancy. Our DoGA incorporates three principle components: 1) Generation of attribute-based prompts, consisting of linguistic definitions enriched with common-sense visible attributes and hard negative notations deriving from the image-specific attribute features; 2) Paralleled entity fusion and optimization, designed to manage long attribute-based descriptions and negative concepts efficiently; and 3) Prompt-wise grouped training to accommodate model to perform many-to-many assignments, facilitating simultaneous training and inferring with multiple attribute-based synonyms. Extensive experiments demonstrate that training with synonymous attribute-based prompts allows DoGA to generalize multi-granular prompts and surpass previous state-of-the-art approaches, yielding 50.2 on the COCO and 38.0 on the LVIS benchmarks under the zero-short setting. We will make our code publicly available upon acceptance.

DoGA: Enhancing Grounded Object Detection via Grouped Pre-Training with Attributes

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information