Exploring Stochastic Autoregressive Image Modeling for Visual Representation

Authors

  • Yu Qi Tsinghua University
  • Fan Yang SenseTime Research
  • Yousong Zhu Institute of Automation, Chinese Academy of Sciences
  • Yufei Liu Tsinghua University
  • Liwei Wu SenseTime Research
  • Rui Zhao SenseTime Research Qing Yuan Research Institute, Shanghai Jiao Tong University, Shanghai, China
  • Wei Li SenseTime Research

DOI:

https://doi.org/10.1609/aaai.v37i2.25300

Keywords:

CV: Representation Learning for Vision

Abstract

Autoregressive language modeling (ALM) has been successfully used in self-supervised pre-training in Natural language processing (NLP). However, this paradigm has not achieved comparable results with other self-supervised approaches in computer vision (e.g., contrastive learning, masked image modeling). In this paper, we try to find the reason why autoregressive modeling does not work well on vision tasks. To tackle this problem, we fully analyze the limitation of visual autoregressive methods and proposed a novel stochastic autoregressive image modeling (named SAIM) by the two simple designs. First, we serialize the image into patches. Second, we employ the stochastic permutation strategy to generate an effective and robust image context which is critical for vision tasks. To realize this task, we create a parallel encoder-decoder training process in which the encoder serves a similar role to the standard vision transformer focusing on learning the whole contextual information, and meanwhile the decoder predicts the content of the current position so that the encoder and decoder can reinforce each other. Our method significantly improves the performance of autoregressive image modeling and achieves the best accuracy (83.9%) on the vanilla ViT-Base model among methods using only ImageNet-1K data. Transfer performance in downstream tasks also shows that our model achieves competitive performance. Code is available at https://github.com/qiy20/SAIM.

Downloads

Published

2023-06-26

How to Cite

Qi, Y., Yang, F., Zhu, Y., Liu, Y., Wu, L., Zhao, R., & Li, W. (2023). Exploring Stochastic Autoregressive Image Modeling for Visual Representation. Proceedings of the AAAI Conference on Artificial Intelligence, 37(2), 2074-2081. https://doi.org/10.1609/aaai.v37i2.25300

Issue

Section

AAAI Technical Track on Computer Vision II