Video-Context Aligned Transformer for Video Question Answering

Authors

  • Linlin Zong Dalian University of Technology
  • Jiahui Wan Dalian University of Technology
  • Xianchao Zhang Dalian University of Technology
  • Xinyue Liu Dalian University of Technology
  • Wenxin Liang Dalian University of Technology
  • Bo Xu Dalian University of Technology

DOI:

https://doi.org/10.1609/aaai.v38i17.29954

Keywords:

NLP: Question Answering

Abstract

Video question answering involves understanding video content to generate accurate answers to questions. Recent studies have successfully modeled video features and achieved diverse multimodal interaction, yielding impressive outcomes. However, they have overlooked the fact that the video contains richer instances and events beyond the scope of the stated question. Extremely imbalanced alignment of information from both sides leads to significant instability in reasoning. To address this concern, we propose the Video-Context Aligned Transformer (V-CAT), which leverages the context to achieve semantic and content alignment between video and question. Specifically, the video and text are encoded into a shared semantic space initially. We apply contrastive learning to global video token and context token to enhance the semantic alignment. Then, the pooled context feature is utilized to obtain corresponding visual content. Finally, the answer is decoded by integrating the refined video and question features. We evaluate the effectiveness of V-CAT on MSVD-QA and MSRVTT-QA dataset, both achieving state-of-the-art performance. Extended experiments further analyze and demonstrate the effectiveness of each proposed module.

Published

2024-03-24

How to Cite

Zong, L., Wan, J., Zhang, X., Liu, X., Liang, W., & Xu, B. (2024). Video-Context Aligned Transformer for Video Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence, 38(17), 19795–19803. https://doi.org/10.1609/aaai.v38i17.29954

Issue

Section

AAAI Technical Track on Natural Language Processing II