CF-DETR: Coarse-to-Fine Transformers for End-to-End Object Detection

Authors

  • Xipeng Cao Beijing University of Posts and Telecommunications
  • Peng Yuan Huawei Noah’s Ark Lab
  • Bailan Feng Huawei Noah’s Ark Lab
  • Kun Niu Beijing University of Posts and Telecommunications

DOI:

https://doi.org/10.1609/aaai.v36i1.19893

Keywords:

Computer Vision (CV)

Abstract

The recently proposed DEtection TRansformer (DETR) achieves promising performance for end-to-end object detection. However, it has relatively lower detection performance on small objects and suffers from slow convergence. This paper observed that DETR performs surprisingly well even on small objects when measuring Average Precision (AP) at decreased Intersection-over-Union (IoU) thresholds. Motivated by this observation, we propose a simple way to improve DETR by refining the coarse features and predicted locations. Specifically, we propose a novel Coarse-to-Fine (CF) decoder layer constituted of a coarse layer and a carefully designed fine layer. Within each CF decoder layer, the extracted local information (region of interest feature) is introduced into the flow of global context information from the coarse layer to refine and enrich the object query features via the fine layer. In the fine layer, the multi-scale information can be fully explored and exploited via the Adaptive Scale Fusion(ASF) module and Local Cross-Attention (LCA) module. The multi-scale information can also be enhanced by another proposed Transformer Enhanced FPN (TEF) module to further improve the performance. With our proposed framework (named CF-DETR), the localization accuracy of objects (especially for small objects) can be largely improved. As a byproduct, the slow convergence issue of DETR can also be addressed. The effectiveness of CF-DETR is validated via extensive experiments on the coco benchmark. CF-DETR achieves state-of-the-art performance among end-to-end detectors, e.g., achieving 47.8 AP using ResNet-50 with 36 epochs in the standard 3x training schedule.

Downloads

Published

2022-06-28

How to Cite

Cao, X., Yuan, P., Feng, B., & Niu, K. (2022). CF-DETR: Coarse-to-Fine Transformers for End-to-End Object Detection. Proceedings of the AAAI Conference on Artificial Intelligence, 36(1), 185-193. https://doi.org/10.1609/aaai.v36i1.19893

Issue

Section

AAAI Technical Track on Computer Vision I