Audio-Visual Localization by Synthetic Acoustic Image Generation

Valentina Sanguineti; Pietro Morerio; Alessio Del Bue; Vittorio Murino

doi:10.1609/aaai.v35i3.16354

Authors

Valentina Sanguineti Pattern Analysis & Computer Vision, Istituto Italiano di Tecnologia, Genoa, Italy University of Genova, Genoa, Italy
Pietro Morerio Pattern Analysis & Computer Vision, Istituto Italiano di Tecnologia, Genoa, Italy
Alessio Del Bue Visual Geometry and Modelling, Istituto Italiano di Tecnologia, Genoa, Italy
Vittorio Murino Pattern Analysis & Computer Vision, Istituto Italiano di Tecnologia, Genoa, Italy University of Verona, Verona, Italy Huawei Technologies Ltd., Ireland Research Center, Dublin, Ireland

DOI:

https://doi.org/10.1609/aaai.v35i3.16354

Keywords:

Multi-modal Vision, Scene Analysis & Understanding, Multimodal Learning, Unsupervised & Self-Supervised Learning

Abstract

Acoustic images constitute an emergent data modality for multimodal scene understanding. Such images have the peculiarity to distinguish the spectral signature of sounds coming from different directions in space, thus providing richer information than the one derived from mono and binaural microphones. However, acoustic images are typically generated by cumbersome microphone arrays, which are not as widespread as ordinary microphones mounted on optical cameras. To exploit this empowered modality while using standard microphones and cameras we propose to leverage the generation of synthetic acoustic images from common audio-video data for the task of audio-visual localization. The generation of synthetic acoustic images is obtained by a novel deep architecture, based on Variational Autoencoder and U-Net models, which is trained to reconstruct the ground truth spatialized audio data collected by a microphone array, from the associated video and its corresponding monaural audio signal. Namely, the model learns how to mimic what an array of microphones can produce in the same conditions. We assess the quality of the generated synthetic acoustic images on the task of unsupervised sound source localization in a qualitative and quantitative manner, while also considering standard generation metrics. Our model is evaluated by considering both multimodal datasets containing acoustic images, used for the training, and unseen datasets containing just monaural audio signals and RGB frames, showing to reach more accurate localization results as compared to the state of the art.

Audio-Visual Localization by Synthetic Acoustic Image Generation

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription