Enhancing Audio-Visual Association with Self-Supervised Curriculum Learning

Jingran Zhang; Xing Xu; Fumin Shen; Huimin Lu; Xin Liu; Heng Tao Shen

doi:10.1609/aaai.v35i4.16447

Authors

Jingran Zhang Center for Future Multimedia and School of Computer Science and Engineering, University of Electronic Science and Technology of China
Xing Xu Center for Future Multimedia and School of Computer Science and Engineering, University of Electronic Science and Technology of China
Fumin Shen Center for Future Multimedia and School of Computer Science and Engineering, University of Electronic Science and Technology of China
Huimin Lu Kyushu Institute of Technology
Xin Liu Huaqiao University
Heng Tao Shen Center for Future Multimedia and School of Computer Science and Engineering, University of Electronic Science and Technology of China

DOI:

https://doi.org/10.1609/aaai.v35i4.16447

Keywords:

Applications

Abstract

The recent success of audio-visual representations learning can be largely attributed to their pervasive concurrency property, which can be used as a self-supervision signal and extract correlation information. While most recent works focus on capturing the shared associations between the audio and visual modalities, they rarely consider multiple audio and video pairs at once and pay little attention to exploiting the valuable information within each modality. To tackle this problem, we propose a novel audio-visual representation learning method dubbed self-supervised curriculum learning (SSCL) under the teacher-student learning manner. Specifically, taking advantage of contrastive learning, a two-stage scheme is exploited, which transfers the cross-modal information between teacher and student model as a phased process. The proposed SSCL approach regards the pervasive property of audiovisual concurrency as latent supervision and mutually distills the structure knowledge of visual to audio data. Notably, the SSCL method can learn discriminative audio and visual representations for various downstream applications. Extensive experiments conducted on both action video recognition and audio sound recognition tasks show the remarkably improved performance of the SSCL method compared with the state-of-the-art self-supervised audio-visual representation learning methods.

Enhancing Audio-Visual Association with Self-Supervised Curriculum Learning

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information