Exploring Stochastic Autoregressive Image Modeling for Visual Representation

Yu Qi; Fan Yang; Yousong Zhu; Yufei Liu; Liwei Wu; Rui Zhao; Wei Li

doi:10.1609/aaai.v37i2.25300

Authors

Yu Qi Tsinghua University
Fan Yang SenseTime Research
Yousong Zhu Institute of Automation, Chinese Academy of Sciences
Yufei Liu Tsinghua University
Liwei Wu SenseTime Research
Rui Zhao SenseTime Research Qing Yuan Research Institute, Shanghai Jiao Tong University, Shanghai, China
Wei Li SenseTime Research

DOI:

https://doi.org/10.1609/aaai.v37i2.25300

Keywords:

CV: Representation Learning for Vision

Abstract

Autoregressive language modeling (ALM) has been successfully used in self-supervised pre-training in Natural language processing (NLP). However, this paradigm has not achieved comparable results with other self-supervised approaches in computer vision (e.g., contrastive learning, masked image modeling). In this paper, we try to find the reason why autoregressive modeling does not work well on vision tasks. To tackle this problem, we fully analyze the limitation of visual autoregressive methods and proposed a novel stochastic autoregressive image modeling (named SAIM) by the two simple designs. First, we serialize the image into patches. Second, we employ the stochastic permutation strategy to generate an effective and robust image context which is critical for vision tasks. To realize this task, we create a parallel encoder-decoder training process in which the encoder serves a similar role to the standard vision transformer focusing on learning the whole contextual information, and meanwhile the decoder predicts the content of the current position so that the encoder and decoder can reinforce each other. Our method significantly improves the performance of autoregressive image modeling and achieves the best accuracy (83.9%) on the vanilla ViT-Base model among methods using only ImageNet-1K data. Transfer performance in downstream tasks also shows that our model achieves competitive performance. Code is available at https://github.com/qiy20/SAIM.

Exploring Stochastic Autoregressive Image Modeling for Visual Representation

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription