Multimodal Keyless Attention Fusion for Video Classification

Xiang Long; Chuang Gan; Gerard Melo; Xiao Liu; Yandong Li; Fu Li; Shilei Wen

doi:10.1609/aaai.v32i1.12319

Authors

Xiang Long Tsinghua University
Chuang Gan Tsinghua University
Gerard Melo Rutgers University
Xiao Liu Baidu
Yandong Li Baidu
Fu Li Baidu
Shilei Wen Baidu

DOI:

https://doi.org/10.1609/aaai.v32i1.12319

Keywords:

Video Classification, Attention Mechanism

Abstract

The problem of video classification is inherently sequential and multimodal, and deep neural models hence need to capture and aggregate the most pertinent signals for a given input video. We propose Keyless Attention as an elegant and efficient means to more effectively account for the sequential nature of the data. Moreover, comparing a variety of multimodal fusion methods, we find that Multimodal Keyless Attention Fusion is the most successful at discerning interactions between modalities. We experiment on four highly heterogeneous datasets, UCF101, ActivityNet, Kinetics, and YouTube-8M to validate our conclusion, and show that our approach achieves highly competitive results. Especially on large-scale data, our method has great advantages in efficiency and performance. Most remarkably, our best single model can achieve 77.0% in terms of the top-1 accuracy and 93.2% in terms of the top-5 accuracy on the Kinetics validation set, and achieve 82.2% in terms of GAP@20 on the official YouTube-8M test set.

Multimodal Keyless Attention Fusion for Video Classification

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription