DoGA: Enhancing Grounded Object Detection via Grouped Pre-Training with Attributes

Authors

  • Yang Liu Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences
  • Feng Hou Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences
  • Yunjie Peng Beihang University
  • Gangjian Zhang AI Lab, Lenovo Research
  • Yao Zhang AI Lab, Lenovo Research
  • Dong Xie AI Lab, Lenovo Research
  • Peng Wang AI Lab, Lenovo Research
  • Yang Zhang AI Lab, Lenovo Research
  • Jiang Tian AI Lab, Lenovo Research
  • Zhongchao Shi AI Lab, Lenovo Research
  • Jianping Fan AI Lab, Lenovo Research
  • Zhiqiang He Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences Lenovo Ltd.

DOI:

https://doi.org/10.1609/aaai.v39i6.32603

Abstract

Recent advances in vision-language pre-training have significantly enhanced the model capabilities on grounded object detection. However, these studies often pre-train with coarse-grained text prompts, such as plain category names and brief grounded phrases. This limitation curtails the model's capacity for fine-grained linguistic comprehension and leads to a significant decline in performance when faced with detailed descriptions or contextual information. To tackle these problems, we develop DoGA: Detect objects with Grouped Attributes, which employs commonly apparent attributes to bridge different granular semantics and uses specific attributes to identify the object discrepancy. Our DoGA incorporates three principle components: 1) Generation of attribute-based prompts, consisting of linguistic definitions enriched with common-sense visible attributes and hard negative notations deriving from the image-specific attribute features; 2) Paralleled entity fusion and optimization, designed to manage long attribute-based descriptions and negative concepts efficiently; and 3) Prompt-wise grouped training to accommodate model to perform many-to-many assignments, facilitating simultaneous training and inferring with multiple attribute-based synonyms. Extensive experiments demonstrate that training with synonymous attribute-based prompts allows DoGA to generalize multi-granular prompts and surpass previous state-of-the-art approaches, yielding 50.2 on the COCO and 38.0 on the LVIS benchmarks under the zero-short setting. We will make our code publicly available upon acceptance.

Downloads

Published

2025-04-11

How to Cite

Liu, Y., Hou, F., Peng, Y., Zhang, G., Zhang, Y., Xie, D., … He, Z. (2025). DoGA: Enhancing Grounded Object Detection via Grouped Pre-Training with Attributes. Proceedings of the AAAI Conference on Artificial Intelligence, 39(6), 5658–5666. https://doi.org/10.1609/aaai.v39i6.32603

Issue

Section

AAAI Technical Track on Computer Vision V