DisCo DETR: Distance-aware Multi-view Contrastive Learning for DETR Pre-training

Chao Ouyang; Yuyang Bai; Jun Zhang; Tianlu Gao; Jun Hao; Lijun Kong; David Wenzhong Gao

doi:10.1609/aaai.v40i29.39650

Authors

Chao Ouyang Wuhan University
Yuyang Bai Wuhan University
Jun Zhang Wuhan University
Tianlu Gao Wuhan University
Jun Hao China Datang Technology Innovation Co., Ltd.
Lijun Kong China Yangtze Power Co., Ltd.
David Wenzhong Gao Wuhan University

DOI:

https://doi.org/10.1609/aaai.v40i29.39650

Abstract

Recent self-supervised pre-training methods for object detection often rely on generic object proposals for localization and semantic feature learning for classification, but they yield limited improvements when applied to Detection Transformers (DETR) due to a lack of architectural alignment. Hence, we propose an elegant and versatile self-supervised framework tailored for DETR-like models called Distance-aware Multi-view Contrastive Learning (DisCo DETR). DisCo DETR enhances localization and semantic features through two core components. (i) Distance-aware Multi-view Object Query Fusion explicitly guides object queries to focus on spatially close objects across views, stabilizing training and improving localization accuracy. (ii) Contrastive Learning for DETR uses native bipartite matching to identify positive output pairs across views and pull them closer, enhancing semantic features discrimination with no extra matching. DisCo DETR can be seamlessly integrated into DETR-like models and achieves SOTA transfer performance on PASCAL VOC and COCO benchmarks across multiple variants.

DisCo DETR: Distance-aware Multi-view Contrastive Learning for DETR Pre-training

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information