Finding Action Tubes with a Sparse-to-Dense Framework

Authors

  • Yuxi Li Shanghai Jiao Tong University
  • Weiyao Lin Shanghai Jiao Tong University
  • Tao Wang Shanghai Jiao Tong University
  • John See Multimedia University
  • Rui Qian Shanghai Jiao Tong University
  • Ning Xu Adobe Research
  • Limin Wang Nanjing University
  • Shugong Xu Shanghai University

DOI:

https://doi.org/10.1609/aaai.v34i07.6811

Abstract

The task of spatial-temporal action detection has attracted increasing researchers. Existing dominant methods solve this problem by relying on short-term information and dense serial-wise detection on each individual frames or clips. Despite their effectiveness, these methods showed inadequate use of long-term information and are prone to inefficiency. In this paper, we propose for the first time, an efficient framework that generates action tube proposals from video streams with a single forward pass in a sparse-to-dense manner. There are two key characteristics in this framework: (1) Both long-term and short-term sampled information are explicitly utilized in our spatio-temporal network, (2) A new dynamic feature sampling module (DTS) is designed to effectively approximate the tube output while keeping the system tractable. We evaluate the efficacy of our model on the UCF101-24, JHMDB-21 and UCFSports benchmark datasets, achieving promising results that are competitive to state-of-the-art methods. The proposed sparse-to-dense strategy rendered our framework about 7.6 times more efficient than the nearest competitor.

Downloads

Published

2020-04-03

How to Cite

Li, Y., Lin, W., Wang, T., See, J., Qian, R., Xu, N., Wang, L., & Xu, S. (2020). Finding Action Tubes with a Sparse-to-Dense Framework. Proceedings of the AAAI Conference on Artificial Intelligence, 34(07), 11466-11473. https://doi.org/10.1609/aaai.v34i07.6811

Issue

Section

AAAI Technical Track: Vision