EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE

Authors

  • Junyi Chen Sun Yat-sen University
  • Longteng Guo Institute of Automation, Chinese Academy of Sciences (CASIA)
  • Jia Sun Bytedance Inc
  • Shuai Shao Bytedance Inc
  • Zehuan Yuan Bytedance Inc
  • Liang Lin Sun Yat-sen University
  • Dongyu Zhang Sun Yat-sen University

DOI:

https://doi.org/10.1609/aaai.v38i2.27872

Keywords:

CV: Language and Vision, ML: Multimodal Learning, CV: Multi-modal Vision

Abstract

Building scalable vision-language models to learn from diverse, multimodal data remains an open challenge. In this paper, we introduce an Efficient Vision-languagE foundation model, namely EVE, which is one unified multimodal Transformer pre-trained solely by one unified pre-training task. Specifically, EVE encodes both vision and language within a shared Transformer network integrated with modality-aware sparse Mixture-of-Experts (MoE) modules, which capture modality-specific information by selectively switching to different experts. To unify pre-training tasks of vision and language, EVE performs masked signal modeling on image-text pairs to reconstruct masked signals, i.e., image pixels and text tokens, given visible signals. This simple yet effective pre-training objective accelerates training by 4x compared to the model pre-trained with Image-Text Contrastive and Image-Text Matching losses. Owing to the combination of the unified architecture and pre-training task, EVE is easy to scale up, enabling better downstream performance with fewer resources and faster training speed. Despite its simplicity, EVE achieves state-of-the-art performance on various vision-language downstream tasks, including visual question answering, visual reasoning, and image-text retrieval.

Published

2024-03-24

How to Cite

Chen, J., Guo, L., Sun, J., Shao, S., Yuan, Z., Lin, L., & Zhang, D. (2024). EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE. Proceedings of the AAAI Conference on Artificial Intelligence, 38(2), 1110–1119. https://doi.org/10.1609/aaai.v38i2.27872

Issue

Section

AAAI Technical Track on Computer Vision I