Visual Sound Localization in the Wild by Cross-Modal Interference Erasing

Authors

  • Xian Liu The Chinese University of Hong Kong Zhejiang University
  • Rui Qian The Chinese University of Hong Kong Shanghai Jiao Tong University
  • Hang Zhou The Chinese University of Hong Kong
  • Di Hu Renmin University of China
  • Weiyao Lin Shanghai Jiao Tong university
  • Ziwei Liu Nanyang Technological University
  • Bolei Zhou The Chinese University of Hong Kong
  • Xiaowei Zhou Zhejiang University

DOI:

https://doi.org/10.1609/aaai.v36i2.20073

Keywords:

Computer Vision (CV), Machine Learning (ML)

Abstract

The task of audiovisual sound source localization has been well studied under constrained scenes, where the audio recordings are clean. However, in real world scenarios, audios are usually contaminated by off screen sound and background noise. They will interfere with the procedure of identifying desired sources and building visual sound connections, making previous studies nonapplicable. In this work, we propose the Interference Eraser (IEr) framework, which tackles the problem of audiovisual sound source localization in the wild. The key idea is to eliminate the interference by redefining and carving discriminative audio representations. Specifically, we observe that the previous practice of learning only a single audio representation is insufficient due to the additive nature of audio signals. We thus extend the audio representation with our Audio Instance Identifier module, which clearly distinguishes sounding instances when audio signals of different volumes are unevenly mixed. Then we erase the influence of the audible but off screen sounds and the silent but visible objects by a Cross modal Referrer module with cross modality distillation. Quantitative and qualitative evaluations demonstrate that our framework achieves superior results on sound localization tasks, especially under real world scenarios.

Downloads

Published

2022-06-28

How to Cite

Liu, X., Qian, R., Zhou, H., Hu, D., Lin, W., Liu, Z., Zhou, B., & Zhou, X. (2022). Visual Sound Localization in the Wild by Cross-Modal Interference Erasing. Proceedings of the AAAI Conference on Artificial Intelligence, 36(2), 1801-1809. https://doi.org/10.1609/aaai.v36i2.20073

Issue

Section

AAAI Technical Track on Computer Vision II