Enhancing Audio-Visual Association with Self-Supervised Curriculum Learning

Authors

  • Jingran Zhang Center for Future Multimedia and School of Computer Science and Engineering, University of Electronic Science and Technology of China
  • Xing Xu Center for Future Multimedia and School of Computer Science and Engineering, University of Electronic Science and Technology of China
  • Fumin Shen Center for Future Multimedia and School of Computer Science and Engineering, University of Electronic Science and Technology of China
  • Huimin Lu Kyushu Institute of Technology
  • Xin Liu Huaqiao University
  • Heng Tao Shen Center for Future Multimedia and School of Computer Science and Engineering, University of Electronic Science and Technology of China

DOI:

https://doi.org/10.1609/aaai.v35i4.16447

Keywords:

Applications

Abstract

The recent success of audio-visual representations learning can be largely attributed to their pervasive concurrency property, which can be used as a self-supervision signal and extract correlation information. While most recent works focus on capturing the shared associations between the audio and visual modalities, they rarely consider multiple audio and video pairs at once and pay little attention to exploiting the valuable information within each modality. To tackle this problem, we propose a novel audio-visual representation learning method dubbed self-supervised curriculum learning (SSCL) under the teacher-student learning manner. Specifically, taking advantage of contrastive learning, a two-stage scheme is exploited, which transfers the cross-modal information between teacher and student model as a phased process. The proposed SSCL approach regards the pervasive property of audiovisual concurrency as latent supervision and mutually distills the structure knowledge of visual to audio data. Notably, the SSCL method can learn discriminative audio and visual representations for various downstream applications. Extensive experiments conducted on both action video recognition and audio sound recognition tasks show the remarkably improved performance of the SSCL method compared with the state-of-the-art self-supervised audio-visual representation learning methods.

Downloads

Published

2021-05-18

How to Cite

Zhang, J., Xu, X., Shen, F., Lu, H., Liu, X., & Shen, H. T. (2021). Enhancing Audio-Visual Association with Self-Supervised Curriculum Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 35(4), 3351-3359. https://doi.org/10.1609/aaai.v35i4.16447

Issue

Section

AAAI Technical Track on Computer Vision III