AV-SSAN: Audio-Visual Selective DOA Estimation Through Explicit Multi-Band Semantic-Spatial Alignment

Authors

  • Yu Chen University of Science and Technology Beijing (USTB), China School of Data Science, The Chinese University of Hong Kong (Shenzhen), China
  • Hongxu Zhu Fano, Hong Kong
  • Jiadong Wang Technical University of Munich, Germany
  • Kainan Chen Eigenspace GmbH, Germany
  • Xinyuan Qian University of Science and Technology Beijing (USTB), China

DOI:

https://doi.org/10.1609/aaai.v40i25.39175

Abstract

Audio-visual sound source localization (AV-SSL) estimates the position of sound sources by fusing auditory and visual cues. Current AV-SSL methodologies typically require spatially-paired audio-visual data and cannot selectively localize specific target sources. To address these limitations, we introduce Cross-Instance Audio-Visual Localization (CI-AVL), a novel task that localizes target sound sources using visual prompts from different instances of the same semantic class. CI-AVL enables selective localization without spatially paired data. To solve this task, we propose AV-SSAN, a semantic-spatial alignment framework centered on a Multi-Band Semantic-Spatial Alignment Network (MB-SSA Net). MB-SSA Net decomposes the audio spectrogram into multiple frequency bands, aligns each band with semantic visual prompts, and refines spatial cues to estimate the direction-of-arrival (DoA). To facilitate this research, we construct VGGSound-SSL, a large-scale dataset comprising 13,981 spatial audio clips across 296 categories, each paired with visual prompts. AV-SSAN achieves a mean absolute error of 16.59° and an accuracy of 71.29%, significantly outperforming existing AV-SSL methods.

Published

2026-03-14

How to Cite

Chen, Y., Zhu, H., Wang, J., Chen, K., & Qian, X. (2026). AV-SSAN: Audio-Visual Selective DOA Estimation Through Explicit Multi-Band Semantic-Spatial Alignment. Proceedings of the AAAI Conference on Artificial Intelligence, 40(25), 20409–20417. https://doi.org/10.1609/aaai.v40i25.39175

Issue

Section

AAAI Technical Track on Machine Learning II