Audio-Visual Adaptive Fusion Network for Question Answering Based on Contrastive Learning

Xujian Zhao; Yixin Wang; Peiquan Jin

doi:10.1609/aaai.v39i10.33138

Authors

Xujian Zhao Southwest University of Science and Technology
Yixin Wang Southwest University of Science and Technology
Peiquan Jin University of Science and Technology of China

DOI:

https://doi.org/10.1609/aaai.v39i10.33138

Abstract

The Audio-Visual Question Answering (AVQA) task involves extracting question-related audio-visual clues from both temporal and spatial perspectives to answer questions accurately. Despite the promising performance of existing multi-modal AVQA models, thanks to large-scale pre-trained models, challenges remain in the field. Firstly, aligning audio-visual information across temporal and spatial dimensions is difficult. Secondly, the fusion of audio-visual information is often weighted inadequately, limiting model performance. To address the above issues, we design the Audio-Visual Adaptive Fusion Network (AVAF-Net), which uses contrastive learning to align audio-visual information temporally and spatially and adaptively adjusts fusion weights based on the question. Specifically, we initially align visual and audio information temporally through a temporal-alignment contrastive loss. This is followed by an audio-visual clue-mining module that highlights question-related cues, aligning them with the vocal region spatially using spatial alignment contrastive loss. Additionally, a question-oriented adaptive fusion module assigns different weights to audio and visual modalities based on the question content and then fuses them. The fused audio-visual cues are finally used to predict the answer. Extensive experiments on the MUSIC-AVQA dataset show that AVAF-Net surpasses all baseline models, with a maximum improvement of 15.90% in average accuracy and an average improvement of 9.80%.

Audio-Visual Adaptive Fusion Network for Question Answering Based on Contrastive Learning

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information