Multimodal Keyless Attention Fusion for Video Classification

Authors

  • Xiang Long Tsinghua University
  • Chuang Gan Tsinghua University
  • Gerard Melo Rutgers University
  • Xiao Liu Baidu
  • Yandong Li Baidu
  • Fu Li Baidu
  • Shilei Wen Baidu

DOI:

https://doi.org/10.1609/aaai.v32i1.12319

Keywords:

Video Classification, Attention Mechanism

Abstract

The problem of video classification is inherently sequential and multimodal, and deep neural models hence need to capture and aggregate the most pertinent signals for a given input video. We propose Keyless Attention as an elegant and efficient means to more effectively account for the sequential nature of the data. Moreover, comparing a variety of multimodal fusion methods, we find that Multimodal Keyless Attention Fusion is the most successful at discerning interactions between modalities. We experiment on four highly heterogeneous datasets, UCF101, ActivityNet, Kinetics, and YouTube-8M to validate our conclusion, and show that our approach achieves highly competitive results. Especially on large-scale data, our method has great advantages in efficiency and performance. Most remarkably, our best single model can achieve 77.0% in terms of the top-1 accuracy and 93.2% in terms of the top-5 accuracy on the Kinetics validation set, and achieve 82.2% in terms of GAP@20 on the official YouTube-8M test set.

Downloads

Published

2018-04-27

How to Cite

Long, X., Gan, C., Melo, G., Liu, X., Li, Y., Li, F., & Wen, S. (2018). Multimodal Keyless Attention Fusion for Video Classification. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1). https://doi.org/10.1609/aaai.v32i1.12319