Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer

Authors

  • Min Peng Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences Chongqing School, University of Chinese Academy of Sciences
  • Chongyang Wang Tsinghua University
  • Yu Shi Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences
  • Xiang-Dong Zhou Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences

DOI:

https://doi.org/10.1609/aaai.v37i2.25296

Keywords:

CV: Language and Vision, CV: Scene Analysis & Understanding, CV: Video Understanding & Activity Analysis, ML: Multimodal Learning, SNLP: Question Answering

Abstract

This paper presents a new method for end-to-end Video Question Answering (VideoQA), aside from the current popularity of using large-scale pre-training with huge feature extractors. We achieve this with a pyramidal multimodal transformer (PMT) model, which simply incorporates a learnable word embedding layer, a few convolutional and transformer layers. We use the anisotropic pyramid to fulfill video-language interactions across different spatio-temporal scales. In addition to the canonical pyramid, which includes both bottom-up and top-down pathways with lateral connections, novel strategies are proposed to decompose the visual feature stream into spatial and temporal sub-streams at different scales and implement their interactions with the linguistic semantics while preserving the integrity of local and global semantics. We demonstrate better or on-par performances with high computational efficiency against state-of-the-art methods on five VideoQA benchmarks. Our ablation study shows the scalability of our model that achieves competitive results for text-to-video retrieval by leveraging feature extractors with reusable pre-trained weights, and also the effectiveness of the pyramid. Code available at: https://github.com/Trunpm/PMT-AAAI23.

Downloads

Published

2023-06-26

How to Cite

Peng, M., Wang, C., Shi, Y., & Zhou, X.-D. (2023). Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer. Proceedings of the AAAI Conference on Artificial Intelligence, 37(2), 2038-2046. https://doi.org/10.1609/aaai.v37i2.25296

Issue

Section

AAAI Technical Track on Computer Vision II