Cross-Modal Label Contrastive Learning for Unsupervised Audio-Visual Event Localization

Authors

  • Peijun Bao Nanyang Technological University
  • Wenhan Yang Nanyang Technological University Peng Cheng Laboratory
  • Boon Poh Ng Nanyang Technological University
  • Meng Hwa Er Nanyang Technological University
  • Alex C. Kot Nanyang Technological University

DOI:

https://doi.org/10.1609/aaai.v37i1.25093

Keywords:

CV: Video Understanding & Activity Analysis, CV: Image and Video Retrieval, CV: Multi-modal Vision, SNLP: Speech and Multimodality

Abstract

This paper for the first time explores audio-visual event localization in an unsupervised manner. Previous methods tackle this problem in a supervised setting and require segment-level or video-level event category ground-truth to train the model. However, building large-scale multi-modality datasets with category annotations is human-intensive and thus not scalable to real-world applications. To this end, we propose cross-modal label contrastive learning to exploit multi-modal information among unlabeled audio and visual streams as self-supervision signals. At the feature representation level, multi-modal representations are collaboratively learned from audio and visual components by using self-supervised representation learning. At the label level, we propose a novel self-supervised pretext task i.e. label contrasting to self-annotate videos with pseudo-labels for localization model training. Note that irrelevant background would hinder the acquisition of high-quality pseudo-labels and thus lead to an inferior localization model. To address this issue, we then propose an expectation-maximization algorithm that optimizes the pseudo-label acquisition and localization model in a coarse-to-fine manner. Extensive experiments demonstrate that our unsupervised approach performs reasonably well compared to the state-of-the-art supervised methods.

Downloads

Published

2023-06-26

How to Cite

Bao, P., Yang, W., Ng, B. P., Er, M. H., & Kot, A. C. (2023). Cross-Modal Label Contrastive Learning for Unsupervised Audio-Visual Event Localization. Proceedings of the AAAI Conference on Artificial Intelligence, 37(1), 215-222. https://doi.org/10.1609/aaai.v37i1.25093

Issue

Section

AAAI Technical Track on Computer Vision I