Cross-Modal Label Contrastive Learning for Unsupervised Audio-Visual Event Localization

Peijun Bao; Wenhan Yang; Boon Poh Ng; Meng Hwa Er; Alex C. Kot

doi:10.1609/aaai.v37i1.25093

Authors

Peijun Bao Nanyang Technological University
Wenhan Yang Nanyang Technological University Peng Cheng Laboratory
Boon Poh Ng Nanyang Technological University
Meng Hwa Er Nanyang Technological University
Alex C. Kot Nanyang Technological University

DOI:

https://doi.org/10.1609/aaai.v37i1.25093

Keywords:

CV: Video Understanding & Activity Analysis, CV: Image and Video Retrieval, CV: Multi-modal Vision, SNLP: Speech and Multimodality

Abstract

This paper for the first time explores audio-visual event localization in an unsupervised manner. Previous methods tackle this problem in a supervised setting and require segment-level or video-level event category ground-truth to train the model. However, building large-scale multi-modality datasets with category annotations is human-intensive and thus not scalable to real-world applications. To this end, we propose cross-modal label contrastive learning to exploit multi-modal information among unlabeled audio and visual streams as self-supervision signals. At the feature representation level, multi-modal representations are collaboratively learned from audio and visual components by using self-supervised representation learning. At the label level, we propose a novel self-supervised pretext task i.e. label contrasting to self-annotate videos with pseudo-labels for localization model training. Note that irrelevant background would hinder the acquisition of high-quality pseudo-labels and thus lead to an inferior localization model. To address this issue, we then propose an expectation-maximization algorithm that optimizes the pseudo-label acquisition and localization model in a coarse-to-fine manner. Extensive experiments demonstrate that our unsupervised approach performs reasonably well compared to the state-of-the-art supervised methods.

Cross-Modal Label Contrastive Learning for Unsupervised Audio-Visual Event Localization

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription