Generating Natural-Language Video Descriptions Using Text-Mined Knowledge

Authors

  • Niveda Krishnamoorthy University of Texas at Austin
  • Girish Malkarnenkar University of Texas at Austin
  • Raymond Mooney University of Texas at Austin
  • Kate Saenko University of Massachussets Lowell
  • Sergio Guadarrama University of California, Berkeley

DOI:

https://doi.org/10.1609/aaai.v27i1.8679

Keywords:

video description, text mining, grounding

Abstract

We present a holistic data-driven technique that generates natural-language descriptions for videos. We combine the output of state-of-the-art object and activity detectors with "real-world' knowledge to select the most probable subject-verb-object triplet for describing a video. We show that this knowledge, automatically mined from web-scale text corpora, enhances the triplet selection algorithm by providing it contextual information and leads to a four-fold increase in activity identification. Unlike previous methods, our approach can annotate arbitrary videos without requiring the expensive collection and annotation of a similar training video corpus. We evaluate our technique against a baseline that does not use text-mined knowledge and show that humans prefer our descriptions 61% of the time.

Downloads

Published

2013-06-30

How to Cite

Krishnamoorthy, N., Malkarnenkar, G., Mooney, R., Saenko, K., & Guadarrama, S. (2013). Generating Natural-Language Video Descriptions Using Text-Mined Knowledge. Proceedings of the AAAI Conference on Artificial Intelligence, 27(1), 541-547. https://doi.org/10.1609/aaai.v27i1.8679