Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering
DOI:
https://doi.org/10.1609/aaai.v38i4.28116Keywords:
CV: Scene Analysis & Understanding, CV: Language and Vision, CV: Video Understanding & Activity Analysis, NLP: Question AnsweringAbstract
This paper focuses on the Audio-Visual Question Answering (AVQA) task that aims to answer questions derived from untrimmed audible videos. To generate accurate answers, an AVQA model is expected to find the most informative audio-visual clues relevant to the given questions. In this paper, we propose to explicitly consider fine-grained visual objects in video frames (object-level clues) and explore the multi-modal relations (\textit{i.e.}, the object, audio, and question) in terms of feature interaction and model optimization. For the former, we present an end-to-end object-oriented network that adopts a question-conditioned clue discovery module to concentrate audio/visual modalities on respective keywords of the question and designs a modality-conditioned clue collection module to highlight closely associated audio segments or visual objects. For model optimization, we propose an object-aware adaptive-positivity learning strategy that selects the highly semantic-matched multi-modal pair as \textit{positivity}. Specifically, we design two object-aware contrastive loss functions to identify the highly relevant question-object pairs and audio-object pairs, respectively. These selected pairs are constrained to have larger similarity values than the mismatched pairs. The positivity-selecting process is adaptive as the positivity pairs selected in each video frame may be different. These two object-aware objectives help the model understand \textit{which objects are exactly relevant to the question} and \textit{which are making sounds}. Extensive experiments on the MUSIC-AVQA dataset demonstrate the proposed method is effective in finding favorable audio-visual clues and also achieves new state-of-the-art question-answering performance. The code is available at https://github.com/zhangbin-ai/APL.Downloads
Published
2024-03-24
How to Cite
Li, Z., Guo, D., Zhou, J., Zhang, J., & Wang, M. (2024). Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence, 38(4), 3306-3314. https://doi.org/10.1609/aaai.v38i4.28116
Issue
Section
AAAI Technical Track on Computer Vision III