Towards Global Video Scene Segmentation with Context-Aware Transformer

Authors

  • Yang Yang Nanjing University of Science and Technology MIIT Key Lab. of Pattern Analysis and Machine Intelligence, NUAA State Key Lab. for Novel Software Technology, NJU
  • Yurui Huang Nanjing University Of Science And Technology
  • Weili Guo Nanjing University of Science and Technology
  • Baohua Xu Huawei Technologies Company
  • Dingyin Xia Huawei Technologies Company

DOI:

https://doi.org/10.1609/aaai.v37i3.25426

Keywords:

CV: Video Understanding & Activity Analysis, ML: Classification and Regression, ML: Unsupervised & Self-Supervised Learning

Abstract

Videos such as movies or TV episodes usually need to divide the long storyline into cohesive units, i.e., scenes, to facilitate the understanding of video semantics. The key challenge lies in finding the boundaries of scenes by comprehensively considering the complex temporal structure and semantic information. To this end, we introduce a novel Context-Aware Transformer (CAT) with a self-supervised learning framework to learn high-quality shot representations, for generating well-bounded scenes. More specifically, we design the CAT with local-global self-attentions, which can effectively consider both the long-term and short-term context to improve the shot encoding. For training the CAT, we adopt the self-supervised learning schema. Firstly, we leverage shot-to-scene level pretext tasks to facilitate the pre-training with pseudo boundary, which guides CAT to learn the discriminative shot representations that maximize intra-scene similarity and inter-scene discrimination in an unsupervised manner. Then, we transfer contextual representations for fine-tuning the CAT with supervised data, which encourages CAT to accurately detect the boundary for scene segmentation. As a result, CAT is able to learn the context-aware shot representations and provides global guidance for scene segmentation. Our empirical analyses show that CAT can achieve state-of-the-art performance when conducting the scene segmentation task on the MovieNet dataset, e.g., offering 2.15 improvements on AP.

Downloads

Published

2023-06-26

How to Cite

Yang, Y., Huang, Y., Guo, W., Xu, B., & Xia, D. (2023). Towards Global Video Scene Segmentation with Context-Aware Transformer. Proceedings of the AAAI Conference on Artificial Intelligence, 37(3), 3206-3213. https://doi.org/10.1609/aaai.v37i3.25426

Issue

Section

AAAI Technical Track on Computer Vision III