Video-Context Aligned Transformer for Video Question Answering

Linlin Zong; Jiahui Wan; Xianchao Zhang; Xinyue Liu; Wenxin Liang; Bo Xu

doi:10.1609/aaai.v38i17.29954

Authors

Linlin Zong Dalian University of Technology
Jiahui Wan Dalian University of Technology
Xianchao Zhang Dalian University of Technology
Xinyue Liu Dalian University of Technology
Wenxin Liang Dalian University of Technology
Bo Xu Dalian University of Technology

DOI:

https://doi.org/10.1609/aaai.v38i17.29954

Keywords:

NLP: Question Answering

Abstract

Video question answering involves understanding video content to generate accurate answers to questions. Recent studies have successfully modeled video features and achieved diverse multimodal interaction, yielding impressive outcomes. However, they have overlooked the fact that the video contains richer instances and events beyond the scope of the stated question. Extremely imbalanced alignment of information from both sides leads to significant instability in reasoning. To address this concern, we propose the Video-Context Aligned Transformer (V-CAT), which leverages the context to achieve semantic and content alignment between video and question. Specifically, the video and text are encoded into a shared semantic space initially. We apply contrastive learning to global video token and context token to enhance the semantic alignment. Then, the pooled context feature is utilized to obtain corresponding visual content. Finally, the answer is decoded by integrating the refined video and question features. We evaluate the effectiveness of V-CAT on MSVD-QA and MSRVTT-QA dataset, both achieving state-of-the-art performance. Extended experiments further analyze and demonstrate the effectiveness of each proposed module.

Video-Context Aligned Transformer for Video Question Answering

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information