Audio-Visual Localization by Synthetic Acoustic Image Generation

Authors

  • Valentina Sanguineti Pattern Analysis & Computer Vision, Istituto Italiano di Tecnologia, Genoa, Italy University of Genova, Genoa, Italy
  • Pietro Morerio Pattern Analysis & Computer Vision, Istituto Italiano di Tecnologia, Genoa, Italy
  • Alessio Del Bue Visual Geometry and Modelling, Istituto Italiano di Tecnologia, Genoa, Italy
  • Vittorio Murino Pattern Analysis & Computer Vision, Istituto Italiano di Tecnologia, Genoa, Italy University of Verona, Verona, Italy Huawei Technologies Ltd., Ireland Research Center, Dublin, Ireland

Keywords:

Multi-modal Vision, Scene Analysis & Understanding, Multimodal Learning, Unsupervised & Self-Supervised Learning

Abstract

Acoustic images constitute an emergent data modality for multimodal scene understanding. Such images have the peculiarity to distinguish the spectral signature of sounds coming from different directions in space, thus providing richer information than the one derived from mono and binaural microphones. However, acoustic images are typically generated by cumbersome microphone arrays, which are not as widespread as ordinary microphones mounted on optical cameras. To exploit this empowered modality while using standard microphones and cameras we propose to leverage the generation of synthetic acoustic images from common audio-video data for the task of audio-visual localization. The generation of synthetic acoustic images is obtained by a novel deep architecture, based on Variational Autoencoder and U-Net models, which is trained to reconstruct the ground truth spatialized audio data collected by a microphone array, from the associated video and its corresponding monaural audio signal. Namely, the model learns how to mimic what an array of microphones can produce in the same conditions. We assess the quality of the generated synthetic acoustic images on the task of unsupervised sound source localization in a qualitative and quantitative manner, while also considering standard generation metrics. Our model is evaluated by considering both multimodal datasets containing acoustic images, used for the training, and unseen datasets containing just monaural audio signals and RGB frames, showing to reach more accurate localization results as compared to the state of the art.

Downloads

Published

2021-05-18

How to Cite

Sanguineti, V., Morerio, P., Del Bue, A., & Murino, V. (2021). Audio-Visual Localization by Synthetic Acoustic Image Generation. Proceedings of the AAAI Conference on Artificial Intelligence, 35(3), 2523-2531. Retrieved from https://ojs.aaai.org/index.php/AAAI/article/view/16354

Issue

Section

AAAI Technical Track on Computer Vision II