Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer

Min Peng; Chongyang Wang; Yu Shi; Xiang-Dong Zhou

doi:10.1609/aaai.v37i2.25296

Authors

Min Peng Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences Chongqing School, University of Chinese Academy of Sciences
Chongyang Wang Tsinghua University
Yu Shi Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences
Xiang-Dong Zhou Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences

DOI:

https://doi.org/10.1609/aaai.v37i2.25296

Keywords:

CV: Language and Vision, CV: Scene Analysis & Understanding, CV: Video Understanding & Activity Analysis, ML: Multimodal Learning, SNLP: Question Answering

Abstract

This paper presents a new method for end-to-end Video Question Answering (VideoQA), aside from the current popularity of using large-scale pre-training with huge feature extractors. We achieve this with a pyramidal multimodal transformer (PMT) model, which simply incorporates a learnable word embedding layer, a few convolutional and transformer layers. We use the anisotropic pyramid to fulfill video-language interactions across different spatio-temporal scales. In addition to the canonical pyramid, which includes both bottom-up and top-down pathways with lateral connections, novel strategies are proposed to decompose the visual feature stream into spatial and temporal sub-streams at different scales and implement their interactions with the linguistic semantics while preserving the integrity of local and global semantics. We demonstrate better or on-par performances with high computational efficiency against state-of-the-art methods on five VideoQA benchmarks. Our ablation study shows the scalability of our model that achieves competitive results for text-to-video retrieval by leveraging feature extractors with reusable pre-trained weights, and also the effectiveness of the pyramid. Code available at: https://github.com/Trunpm/PMT-AAAI23.

Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription