i-Code: An Integrative and Composable Multimodal Learning Framework


  • Ziyi Yang Microsoft
  • Yuwei Fang Microsoft
  • Chenguang Zhu Microsoft
  • Reid Pryzant Microsoft
  • DongDong Chen Microsoft
  • Yu Shi Microsoft
  • Yichong Xu Microsoft
  • Yao Qian Microsoft
  • Mei Gao Microsoft
  • Yi-Ling Chen Microsoft
  • Liyang Lu Microsoft
  • Yujia Xie Microsoft
  • Robert Gmyr Microsoft
  • Noel Codella Microsoft
  • Naoyuki Kanda Microsoft
  • Bin Xiao Microsoft
  • Lu Yuan Microsoft
  • Takuya Yoshioka Microsoft
  • Michael Zeng Microsoft
  • Xuedong Huang Microsoft




ML: Multimodal Learning, ML: Representation Learning, ML: Unsupervised & Self-Supervised Learning


Human intelligence is multimodal; we integrate visual, linguistic, and acoustic signals to maintain a holistic worldview. Most current pretraining methods, however, are limited to one or two modalities. We present i-Code, a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations. In this framework, data from each modality are first given to pretrained single-modality encoders. The encoder outputs are then integrated with a multimodal fusion network, which uses novel merge- and co-attention mechanisms to effectively combine information from the different modalities. The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning. Unlike previous research using only video for pretraining, the i-Code framework can dynamically process single, dual, and triple-modality data during training and inference, flexibly projecting different combinations of modalities into a single representation space. Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five multimodal understanding tasks and single-modality benchmarks, improving by as much as 11% and demonstrating the power of integrative multimodal pretraining.




How to Cite

Yang, Z., Fang, Y., Zhu, C., Pryzant, R., Chen, D., Shi, Y., Xu, Y., Qian, Y., Gao, M., Chen, Y.-L., Lu, L., Xie, Y., Gmyr, R., Codella, N., Kanda, N., Xiao, B., Yuan, L., Yoshioka, T., Zeng, M., & Huang, X. (2023). i-Code: An Integrative and Composable Multimodal Learning Framework. Proceedings of the AAAI Conference on Artificial Intelligence, 37(9), 10880-10890. https://doi.org/10.1609/aaai.v37i9.26290



AAAI Technical Track on Machine Learning IV