DisCo DETR: Distance-aware Multi-view Contrastive Learning for DETR Pre-training

Authors

  • Chao Ouyang Wuhan University
  • Yuyang Bai Wuhan University
  • Jun Zhang Wuhan University
  • Tianlu Gao Wuhan University
  • Jun Hao China Datang Technology Innovation Co., Ltd.
  • Lijun Kong China Yangtze Power Co., Ltd.
  • David Wenzhong Gao Wuhan University

DOI:

https://doi.org/10.1609/aaai.v40i29.39650

Abstract

Recent self-supervised pre-training methods for object detection often rely on generic object proposals for localization and semantic feature learning for classification, but they yield limited improvements when applied to Detection Transformers (DETR) due to a lack of architectural alignment. Hence, we propose an elegant and versatile self-supervised framework tailored for DETR-like models called Distance-aware Multi-view Contrastive Learning (DisCo DETR). DisCo DETR enhances localization and semantic features through two core components. (i) Distance-aware Multi-view Object Query Fusion explicitly guides object queries to focus on spatially close objects across views, stabilizing training and improving localization accuracy. (ii) Contrastive Learning for DETR uses native bipartite matching to identify positive output pairs across views and pull them closer, enhancing semantic features discrimination with no extra matching. DisCo DETR can be seamlessly integrated into DETR-like models and achieves SOTA transfer performance on PASCAL VOC and COCO benchmarks across multiple variants.

Downloads

Published

2026-03-14

How to Cite

Ouyang, C., Bai, Y., Zhang, J., Gao, T., Hao, J., Kong, L., & Gao, D. W. (2026). DisCo DETR: Distance-aware Multi-view Contrastive Learning for DETR Pre-training. Proceedings of the AAAI Conference on Artificial Intelligence, 40(29), 24656–24664. https://doi.org/10.1609/aaai.v40i29.39650

Issue

Section

AAAI Technical Track on Machine Learning VI