Exploring Stochastic Autoregressive Image Modeling for Visual Representation
DOI:
https://doi.org/10.1609/aaai.v37i2.25300Keywords:
CV: Representation Learning for VisionAbstract
Autoregressive language modeling (ALM) has been successfully used in self-supervised pre-training in Natural language processing (NLP). However, this paradigm has not achieved comparable results with other self-supervised approaches in computer vision (e.g., contrastive learning, masked image modeling). In this paper, we try to find the reason why autoregressive modeling does not work well on vision tasks. To tackle this problem, we fully analyze the limitation of visual autoregressive methods and proposed a novel stochastic autoregressive image modeling (named SAIM) by the two simple designs. First, we serialize the image into patches. Second, we employ the stochastic permutation strategy to generate an effective and robust image context which is critical for vision tasks. To realize this task, we create a parallel encoder-decoder training process in which the encoder serves a similar role to the standard vision transformer focusing on learning the whole contextual information, and meanwhile the decoder predicts the content of the current position so that the encoder and decoder can reinforce each other. Our method significantly improves the performance of autoregressive image modeling and achieves the best accuracy (83.9%) on the vanilla ViT-Base model among methods using only ImageNet-1K data. Transfer performance in downstream tasks also shows that our model achieves competitive performance. Code is available at https://github.com/qiy20/SAIM.Downloads
Published
2023-06-26
How to Cite
Qi, Y., Yang, F., Zhu, Y., Liu, Y., Wu, L., Zhao, R., & Li, W. (2023). Exploring Stochastic Autoregressive Image Modeling for Visual Representation. Proceedings of the AAAI Conference on Artificial Intelligence, 37(2), 2074-2081. https://doi.org/10.1609/aaai.v37i2.25300
Issue
Section
AAAI Technical Track on Computer Vision II