Reasoning with Heterogeneous Graph Alignment for Video Question Answering

Pin Jiang; Yahong Han

doi:10.1609/aaai.v34i07.6767

Authors

Pin Jiang Tianjin University
Yahong Han Tianjin University

DOI:

https://doi.org/10.1609/aaai.v34i07.6767

Abstract

The dominant video question answering methods are based on fine-grained representation or model-specific attention mechanism. They usually process video and question separately, then feed the representations of different modalities into following late fusion networks. Although these methods use information of one modality to boost the other, they neglect to integrate correlations of both inter- and intra-modality in an uniform module. We propose a deep heterogeneous graph alignment network over the video shots and question words. Furthermore, we explore the network architecture from four steps: representation, fusion, alignment, and reasoning. Within our network, the inter- and intra-modality information can be aligned and interacted simultaneously over the heterogeneous graph and used for cross-modal reasoning. We evaluate our method on three benchmark datasets and conduct extensive ablation study to the effectiveness of the network architecture. Experiments show the network to be superior in quality.

Reasoning with Heterogeneous Graph Alignment for Video Question Answering

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription