DisCo DETR: Distance-aware Multi-view Contrastive Learning for DETR Pre-training
DOI:
https://doi.org/10.1609/aaai.v40i29.39650Abstract
Recent self-supervised pre-training methods for object detection often rely on generic object proposals for localization and semantic feature learning for classification, but they yield limited improvements when applied to Detection Transformers (DETR) due to a lack of architectural alignment. Hence, we propose an elegant and versatile self-supervised framework tailored for DETR-like models called Distance-aware Multi-view Contrastive Learning (DisCo DETR). DisCo DETR enhances localization and semantic features through two core components. (i) Distance-aware Multi-view Object Query Fusion explicitly guides object queries to focus on spatially close objects across views, stabilizing training and improving localization accuracy. (ii) Contrastive Learning for DETR uses native bipartite matching to identify positive output pairs across views and pull them closer, enhancing semantic features discrimination with no extra matching. DisCo DETR can be seamlessly integrated into DETR-like models and achieves SOTA transfer performance on PASCAL VOC and COCO benchmarks across multiple variants.Downloads
Published
2026-03-14
How to Cite
Ouyang, C., Bai, Y., Zhang, J., Gao, T., Hao, J., Kong, L., & Gao, D. W. (2026). DisCo DETR: Distance-aware Multi-view Contrastive Learning for DETR Pre-training. Proceedings of the AAAI Conference on Artificial Intelligence, 40(29), 24656–24664. https://doi.org/10.1609/aaai.v40i29.39650
Issue
Section
AAAI Technical Track on Machine Learning VI