Accommodating Audio Modality in CLIP for Multimodal Processing

Authors

  • Ludan Ruan Renmin University of China
  • Anwen Hu Renmin University of China
  • Yuqing Song Renmin University of China
  • Liang Zhang Renmin University of China
  • Sipeng Zheng Renmin University of China
  • Qin Jin Renmin University of China

DOI:

https://doi.org/10.1609/aaai.v37i8.26153

Keywords:

ML: Multimodal Learning, CV: Multi-modal Vision, CV: Video Understanding & Activity Analysis, DMKM: Mining of Visual, Multimedia & Multimodal Data

Abstract

Multimodal processing has attracted much attention lately especially with the success of pre-training. However, the exploration has mainly focused on vision-language pre-training, as introducing more modalities can greatly complicate model design and optimization. In this paper, we extend the state-of-the-art Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing. Specifically, we apply inter-modal and intra-modal contrastive learning to explore the correlation between audio and other modalities in addition to the inner characteristics of the audio modality. Moreover, we further design an audio type token to dynamically learn different audio information type for different scenarios, as both verbal and nonverbal heterogeneous information is conveyed in general audios. Our proposed CLIP4VLA model is validated in different downstream tasks including video retrieval and video captioning, and achieves the state-of-the-art performance on the benchmark datasets of MSR-VTT, VATEX, and Audiocaps.The corresponding code and checkpoints will be released at https://github.com/ludanruan/CLIP4VLA.

Downloads

Published

2023-06-26

How to Cite

Ruan, L., Hu, A., Song, Y., Zhang, L., Zheng, S., & Jin, Q. (2023). Accommodating Audio Modality in CLIP for Multimodal Processing. Proceedings of the AAAI Conference on Artificial Intelligence, 37(8), 9641-9649. https://doi.org/10.1609/aaai.v37i8.26153

Issue

Section

AAAI Technical Track on Machine Learning III