Using Closed Captions as Supervision for Video Activity Recognition

Authors

  • Sonal Gupta Stanford University
  • Raymond Mooney University of Texas at Austin

DOI:

https://doi.org/10.1609/aaai.v24i1.7738

Keywords:

Closed Captions, Weak Supervision, Video Retrieval, Human Action Recognition, Captioned Video, Multimodal Learning

Abstract

Recognizing activities in real-world videos is a difficult problem exacerbated by background clutter, changes in camera angle & zoom, and rapid camera movements. Large corpora of labeled videos can be used to train automated activity recognition systems, but this requires expensive human labor and time. This paper explores how closed captions that naturally accompany many videos can act as weak supervision that allows automatically collecting "labeled" data for activity recognition. We show that such an approach can improve activity retrieval in soccer videos. Our system requires no manual labeling of video clips and needs minimal human supervision. We also present a novel caption classifier that uses additional linguistic information to determine whether a specific comment refers to an ongoing activity. We demonstrate that combining linguistic analysis and automatically trained activity recognizers can significantly improve the precision of video retrieval.

Downloads

Published

2010-07-04

How to Cite

Gupta, S., & Mooney, R. (2010). Using Closed Captions as Supervision for Video Activity Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 24(1), 1083-1088. https://doi.org/10.1609/aaai.v24i1.7738

Issue

Section

Reasoning about Plans, Processes and Actions