Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering

Authors

  • Zhangbin Li School of Computer Science and Information Engineering, Hefei University of Technology
  • Dan Guo School of Computer Science and Information Engineering, Hefei University of Technology Institute of Artificial Intelligence, Hefei Comprehensive National Science Center Anhui Zhonghuitong Technology Co., Ltd
  • Jinxing Zhou School of Computer Science and Information Engineering, Hefei University of Technology
  • Jing Zhang School of Computer Science and Information Engineering, Hefei University of Technology
  • Meng Wang School of Computer Science and Information Engineering, Hefei University of Technology Institute of Artificial Intelligence, Hefei Comprehensive National Science Center

DOI:

https://doi.org/10.1609/aaai.v38i4.28116

Keywords:

CV: Scene Analysis & Understanding, CV: Language and Vision, CV: Video Understanding & Activity Analysis, NLP: Question Answering

Abstract

This paper focuses on the Audio-Visual Question Answering (AVQA) task that aims to answer questions derived from untrimmed audible videos. To generate accurate answers, an AVQA model is expected to find the most informative audio-visual clues relevant to the given questions. In this paper, we propose to explicitly consider fine-grained visual objects in video frames (object-level clues) and explore the multi-modal relations (\textit{i.e.}, the object, audio, and question) in terms of feature interaction and model optimization. For the former, we present an end-to-end object-oriented network that adopts a question-conditioned clue discovery module to concentrate audio/visual modalities on respective keywords of the question and designs a modality-conditioned clue collection module to highlight closely associated audio segments or visual objects. For model optimization, we propose an object-aware adaptive-positivity learning strategy that selects the highly semantic-matched multi-modal pair as \textit{positivity}. Specifically, we design two object-aware contrastive loss functions to identify the highly relevant question-object pairs and audio-object pairs, respectively. These selected pairs are constrained to have larger similarity values than the mismatched pairs. The positivity-selecting process is adaptive as the positivity pairs selected in each video frame may be different. These two object-aware objectives help the model understand \textit{which objects are exactly relevant to the question} and \textit{which are making sounds}. Extensive experiments on the MUSIC-AVQA dataset demonstrate the proposed method is effective in finding favorable audio-visual clues and also achieves new state-of-the-art question-answering performance. The code is available at https://github.com/zhangbin-ai/APL.

Published

2024-03-24

How to Cite

Li, Z., Guo, D., Zhou, J., Zhang, J., & Wang, M. (2024). Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence, 38(4), 3306-3314. https://doi.org/10.1609/aaai.v38i4.28116

Issue

Section

AAAI Technical Track on Computer Vision III