i-Code: An Integrative and Composable Multimodal Learning Framework

Ziyi Yang; Yuwei Fang; Chenguang Zhu; Reid Pryzant; DongDong Chen; Yu Shi; Yichong Xu; Yao Qian; Mei Gao; Yi-Ling Chen; Liyang Lu; Yujia Xie; Robert Gmyr; Noel Codella; Naoyuki Kanda; Bin Xiao; Lu Yuan; Takuya Yoshioka; Michael Zeng; Xuedong Huang

doi:10.1609/aaai.v37i9.26290

Authors

Ziyi Yang Microsoft
Yuwei Fang Microsoft
Chenguang Zhu Microsoft
Reid Pryzant Microsoft
DongDong Chen Microsoft
Yu Shi Microsoft
Yichong Xu Microsoft
Yao Qian Microsoft
Mei Gao Microsoft
Yi-Ling Chen Microsoft
Liyang Lu Microsoft
Yujia Xie Microsoft
Robert Gmyr Microsoft
Noel Codella Microsoft
Naoyuki Kanda Microsoft
Bin Xiao Microsoft
Lu Yuan Microsoft
Takuya Yoshioka Microsoft
Michael Zeng Microsoft
Xuedong Huang Microsoft

DOI:

https://doi.org/10.1609/aaai.v37i9.26290

Keywords:

ML: Multimodal Learning, ML: Representation Learning, ML: Unsupervised & Self-Supervised Learning

Abstract

Human intelligence is multimodal; we integrate visual, linguistic, and acoustic signals to maintain a holistic worldview. Most current pretraining methods, however, are limited to one or two modalities. We present i-Code, a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations. In this framework, data from each modality are first given to pretrained single-modality encoders. The encoder outputs are then integrated with a multimodal fusion network, which uses novel merge- and co-attention mechanisms to effectively combine information from the different modalities. The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning. Unlike previous research using only video for pretraining, the i-Code framework can dynamically process single, dual, and triple-modality data during training and inference, flexibly projecting different combinations of modalities into a single representation space. Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five multimodal understanding tasks and single-modality benchmarks, improving by as much as 11% and demonstrating the power of integrative multimodal pretraining.

i-Code: An Integrative and Composable Multimodal Learning Framework

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription