SepFusion: Finding Optimal Fusion Structures for Visual Sound Separation


  • Dongzhan Zhou The University of Sydney
  • Xinchi Zhou The University of Sydney
  • Di Hu Renmin University of China
  • Hang Zhou Baidu Inc.
  • Lei Bai The University of Sydney
  • Ziwei Liu Nanyang Technological University
  • Wanli Ouyang The University of Sydney



Computer Vision (CV)


Multiple modalities can provide rich semantic information; and exploiting such information will normally lead to better performance compared with the single-modality counterpart. However, it is not easy to devise an effective cross-modal fusion structure due to the variations of feature dimensions and semantics, especially when the inputs even come from different sensors, as in the field of audio-visual learning. In this work, we propose SepFusion, a novel framework that can smoothly produce optimal fusion structures for visual-sound separation. The framework is composed of two components, namely the model generator and the evaluator. To construct the generator, we devise a lightweight architecture space that can adapt to different input modalities. In this way, we can easily obtain audio-visual fusion structures according to our demands. For the evaluator, we adopt the idea of neural architecture search to select superior networks effectively. This automatic process can significantly save human efforts while achieving competitive performances. Moreover, since our SepFusion provides a series of strong models, we can utilize the model family for broader applications, such as further promoting performance via model assembly, or providing suitable architectures for the separation of certain instrument classes. These potential applications further enhance the competitiveness of our approach.




How to Cite

Zhou, D., Zhou, X., Hu, D., Zhou, H., Bai, L., Liu, Z., & Ouyang, W. (2022). SepFusion: Finding Optimal Fusion Structures for Visual Sound Separation. Proceedings of the AAAI Conference on Artificial Intelligence, 36(3), 3544-3552.



AAAI Technical Track on Computer Vision III