AVSegFormer: Audio-Visual Segmentation with Transformer

Authors

  • Shengyi Gao Nanjing University
  • Zhe Chen Nanjing University
  • Guo Chen Nanjing University
  • Wenhai Wang The Chinese University of Hong Kong
  • Tong Lu Nanjing University

DOI:

https://doi.org/10.1609/aaai.v38i11.29104

Keywords:

ML: Multimodal Learning, ML: Deep Learning Algorithms, ML: Transfer, Domain Adaptation, Multi-Task Learning

Abstract

Audio-visual segmentation (AVS) aims to locate and segment the sounding objects in a given video, which demands audio-driven pixel-level scene understanding. The existing methods cannot fully process the fine-grained correlations between audio and visual cues across various situations dynamically. They also face challenges in adapting to complex scenarios, such as evolving audio, the coexistence of multiple objects, and more. In this paper, we propose AVSegFormer, a novel framework for AVS that leverages the transformer architecture. Specifically, It comprises a dense audio-visual mixer, which can dynamically adjust interested visual features, and a sparse audio-visual decoder, which implicitly separates audio sources and automatically matches optimal visual features. Combining both components provides a more robust bidirectional conditional multi-modal representation, improving the segmentation performance in different scenarios. Extensive experiments demonstrate that AVSegFormer achieves state-of-the-art results on the AVS benchmark. The code is available at https://github.com/vvvb-github/AVSegFormer.

Published

2024-03-24

How to Cite

Gao, S., Chen, Z., Chen, G., Wang, W., & Lu, T. (2024). AVSegFormer: Audio-Visual Segmentation with Transformer. Proceedings of the AAAI Conference on Artificial Intelligence, 38(11), 12155-12163. https://doi.org/10.1609/aaai.v38i11.29104

Issue

Section

AAAI Technical Track on Machine Learning II