RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning

Authors

  • Peihao Chen School of Software Engineering, South China University of Technology Pazhou Laboratory
  • Deng Huang School of Software Engineering, South China University of Technology
  • Dongliang He Baidu Inc.
  • Xiang Long Baidu Inc.
  • Runhao Zeng School of Software Engineering, South China University of Technology
  • Shilei Wen Baidu Inc.
  • Mingkui Tan School of Software Engineering, South China University of Technology Key Laboratory of Big Data and Intelligent Robot, Ministry of Education
  • Chuang Gan MIT-IBM Watson AI Lab

Keywords:

Video Understanding & Activity Analysis

Abstract

We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only, which can be reused for downstream tasks such as action recognition. This task, however, is extremely challenging due to 1) the highly complex spatial-temporal information in videos and 2) the lack of labeled data for training. Unlike representation learning for static images, it is difficult to construct a suitable self-supervised task to effectively model both motion and appearance features. More recently, several attempts have been made to learn video representation through video playback speed prediction. However, it is non-trivial to obtain precise speed labels for the videos. More critically, the learned models may tend to focus on motion patterns and thus may not learn appearance features well. In this paper, we observe that the relative playback speed is more consistent with motion patterns and thus provides more effective and stable supervision for representation learning. Therefore, we propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels. In this way, we are able to effectively perceive speed and learn better motion features. Moreover, to ensure the learning of appearance features, we further propose an appearance-focused task, where we enforce the model to perceive the appearance difference between two video clips. We show that jointly optimizing the two tasks consistently improves the performance on two downstream tasks (namely, action recognition and video retrieval) w.r.t the increasing pre-training epochs. Remarkably, for action recognition on the UCF101 dataset, we achieve 93.7% accuracy without the use of labeled data for pre-training, which outperforms the ImageNet supervised pre-trained model. Our code, pre-trained models, and supplementary materials can be found at https://github.com/PeihaoChen/RSPNet.

Downloads

Published

2021-05-18

How to Cite

Chen, P., Huang, D., He, D., Long, X., Zeng, R., Wen, S., Tan, M., & Gan, C. (2021). RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 35(2), 1045-1053. Retrieved from https://ojs.aaai.org/index.php/AAAI/article/view/16189

Issue

Section

AAAI Technical Track on Computer Vision I