No More Shortcuts: Realizing the Potential of Temporal Self-Supervision

Authors

  • Ishan Rajendrakumar Dave University of Central Florida
  • Simon Jenni Adobe Research
  • Mubarak Shah University of Central Florida

DOI:

https://doi.org/10.1609/aaai.v38i2.27913

Keywords:

CV: Video Understanding & Activity Analysis, CV: Image and Video Retrieval, CV: Representation Learning for Vision, ML: Unsupervised & Self-Supervised Learning

Abstract

Self-supervised approaches for video have shown impressive results in video understanding tasks. However, unlike early works that leverage temporal self-supervision, current state-of-the-art methods primarily rely on tasks from the image domain (e.g., contrastive learning) that do not explicitly promote the learning of temporal features. We identify two factors that limit existing temporal self-supervision: 1) tasks are too simple, resulting in saturated training performance, and 2) we uncover shortcuts based on local appearance statistics that hinder the learning of high-level features. To address these issues, we propose 1) a more challenging reformulation of temporal self-supervision as frame-level (rather than clip-level) recognition tasks and 2) an effective augmentation strategy to mitigate shortcuts. Our model extends a representation of single video frames, pre-trained through contrastive learning, with a transformer that we train through temporal self-supervision. We demonstrate experimentally that our more challenging frame-level task formulations and the removal of shortcuts drastically improve the quality of features learned through temporal self-supervision. Our extensive experiments show state-of-the-art performance across 10 video understanding datasets, illustrating the generalization ability and robustness of our learned video representations. Project Page: https://daveishan.github.io/nms-webpage.

Published

2024-03-24

How to Cite

Dave, I. R., Jenni, S., & Shah, M. (2024). No More Shortcuts: Realizing the Potential of Temporal Self-Supervision. Proceedings of the AAAI Conference on Artificial Intelligence, 38(2), 1481–1491. https://doi.org/10.1609/aaai.v38i2.27913

Issue

Section

AAAI Technical Track on Computer Vision I