i-Code: An Integrative and Composable Multimodal Learning Framework

Authors

  • Ziyi Yang Microsoft
  • Yuwei Fang Microsoft
  • Chenguang Zhu Microsoft
  • Reid Pryzant Microsoft
  • DongDong Chen Microsoft
  • Yu Shi Microsoft
  • Yichong Xu Microsoft
  • Yao Qian Microsoft
  • Mei Gao Microsoft
  • Yi-Ling Chen Microsoft
  • Liyang Lu Microsoft
  • Yujia Xie Microsoft
  • Robert Gmyr Microsoft
  • Noel Codella Microsoft
  • Naoyuki Kanda Microsoft
  • Bin Xiao Microsoft
  • Lu Yuan Microsoft
  • Takuya Yoshioka Microsoft
  • Michael Zeng Microsoft
  • Xuedong Huang Microsoft

DOI:

https://doi.org/10.1609/aaai.v37i9.26290

Keywords:

ML: Multimodal Learning, ML: Representation Learning, ML: Unsupervised & Self-Supervised Learning

Abstract

Human intelligence is multimodal; we integrate visual, linguistic, and acoustic signals to maintain a holistic worldview. Most current pretraining methods, however, are limited to one or two modalities. We present i-Code, a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations. In this framework, data from each modality are first given to pretrained single-modality encoders. The encoder outputs are then integrated with a multimodal fusion network, which uses novel merge- and co-attention mechanisms to effectively combine information from the different modalities. The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning. Unlike previous research using only video for pretraining, the i-Code framework can dynamically process single, dual, and triple-modality data during training and inference, flexibly projecting different combinations of modalities into a single representation space. Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five multimodal understanding tasks and single-modality benchmarks, improving by as much as 11% and demonstrating the power of integrative multimodal pretraining.

Downloads

Published

2023-06-26

How to Cite

Yang, Z., Fang, Y., Zhu, C., Pryzant, R., Chen, D., Shi, Y., Xu, Y., Qian, Y., Gao, M., Chen, Y.-L., Lu, L., Xie, Y., Gmyr, R., Codella, N., Kanda, N., Xiao, B., Yuan, L., Yoshioka, T., Zeng, M., & Huang, X. (2023). i-Code: An Integrative and Composable Multimodal Learning Framework. Proceedings of the AAAI Conference on Artificial Intelligence, 37(9), 10880-10890. https://doi.org/10.1609/aaai.v37i9.26290

Issue

Section

AAAI Technical Track on Machine Learning IV