DDAE: Towards Deep Dynamic Vision BERT Pretraining

Authors

  • Honghao Chen CRISE, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences
  • Xiangwen Kong MEGVII Technology
  • Xiangyu Zhang MEGVII Technology
  • Xin Zhao CRISE, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences
  • Kaiqi Huang CRISE, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences CAS Center for Excellence in Brain Science and Intelligence Technology

DOI:

https://doi.org/10.1609/aaai.v38i2.27864

Keywords:

CV: Representation Learning for Vision, ML: Unsupervised & Self-Supervised Learning

Abstract

Recently, masked image modeling (MIM) has demonstrated promising prospects in self-supervised representation learning. However, existing MIM frameworks recover all masked patches equivalently, ignoring that the reconstruction difficulty of different patches can vary sharply due to their diverse distance from visible patches. In this paper, we propose a novel deep dynamic supervision to enable MIM methods to dynamically reconstruct patches with different degrees of difficulty at different pretraining phases and depths of the model. Our deep dynamic supervision helps to provide more locality inductive bias for ViTs especially in deep layers, which inherently makes up for the absence of local prior for self-attention mechanism. Built upon the deep dynamic supervision, we propose Deep Dynamic AutoEncoder (DDAE), a simple yet effective MIM framework that utilizes dynamic mechanisms for pixel regression and feature self-distillation simultaneously. Extensive experiments across a variety of vision tasks including ImageNet classification, semantic segmentation on ADE20K and object detection on COCO demonstrate the effectiveness of our approach.

Published

2024-03-24

How to Cite

Chen, H., Kong, X., Zhang, X., Zhao, X., & Huang, K. (2024). DDAE: Towards Deep Dynamic Vision BERT Pretraining. Proceedings of the AAAI Conference on Artificial Intelligence, 38(2), 1037-1045. https://doi.org/10.1609/aaai.v38i2.27864

Issue

Section

AAAI Technical Track on Computer Vision I