Audio-Visual Adaptive Fusion Network for Question Answering Based on Contrastive Learning
DOI:
https://doi.org/10.1609/aaai.v39i10.33138Abstract
The Audio-Visual Question Answering (AVQA) task involves extracting question-related audio-visual clues from both temporal and spatial perspectives to answer questions accurately. Despite the promising performance of existing multi-modal AVQA models, thanks to large-scale pre-trained models, challenges remain in the field. Firstly, aligning audio-visual information across temporal and spatial dimensions is difficult. Secondly, the fusion of audio-visual information is often weighted inadequately, limiting model performance. To address the above issues, we design the Audio-Visual Adaptive Fusion Network (AVAF-Net), which uses contrastive learning to align audio-visual information temporally and spatially and adaptively adjusts fusion weights based on the question. Specifically, we initially align visual and audio information temporally through a temporal-alignment contrastive loss. This is followed by an audio-visual clue-mining module that highlights question-related cues, aligning them with the vocal region spatially using spatial alignment contrastive loss. Additionally, a question-oriented adaptive fusion module assigns different weights to audio and visual modalities based on the question content and then fuses them. The fused audio-visual cues are finally used to predict the answer. Extensive experiments on the MUSIC-AVQA dataset show that AVAF-Net surpasses all baseline models, with a maximum improvement of 15.90% in average accuracy and an average improvement of 9.80%.Downloads
Published
2025-04-11
How to Cite
Zhao, X., Wang, Y., & Jin, P. (2025). Audio-Visual Adaptive Fusion Network for Question Answering Based on Contrastive Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 39(10), 10483-10491. https://doi.org/10.1609/aaai.v39i10.33138
Issue
Section
AAAI Technical Track on Computer Vision IX