Shrinking Temporal Attention in Transformers for Video Action Recognition

Authors

  • Bonan Li University of Chinese Academy of Sciences
  • Pengfei Xiong Tencent
  • Congying Han University of Chinese Academy of Sciences
  • Tiande Guo University of Chinese Academy of Sciences

DOI:

https://doi.org/10.1609/aaai.v36i2.20013

Keywords:

Computer Vision (CV), Machine Learning (ML)

Abstract

Spatiotemporal modeling in an unified architecture is key for video action recognition. This paper proposes a Shrinking Temporal Attention Transformer (STAT), which efficiently builts spatiotemporal attention maps considering the attenuation of spatial attention in short and long temporal sequences. Specifically, for short-term temporal tokens, query token interacts with them in a fine-grained manner in dealing with short-range motion. It then shrinks to a coarse attention in neighborhood for long-term tokens, to provide larger receptive field for long-range spatial aggregation. Both of them are composed in a short-long temporal integrated block to build visual appearances and temporal structure concurrently with lower costly in computation. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple action recognition benchmarks including Kinetics400 and Something-Something v2, outperforming prior methods with 50% less FLOPs and without any pretrained model.

Downloads

Published

2022-06-28

How to Cite

Li, B., Xiong, P., Han, C., & Guo, T. (2022). Shrinking Temporal Attention in Transformers for Video Action Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 36(2), 1263-1271. https://doi.org/10.1609/aaai.v36i2.20013

Issue

Section

AAAI Technical Track on Computer Vision II