Localizing Natural Language in Videos

Authors

  • Jingyuan Chen National University of Singapore
  • Lin Ma Tencent AI Lab
  • Xinpeng Chen Tencent AI Lab
  • Zequn Jie Tencent AI Lab
  • Jiebo Luo University of Rochester

DOI:

https://doi.org/10.1609/aaai.v33i01.33018175

Abstract

In this paper, we consider the task of natural language video localization (NLVL): given an untrimmed video and a natural language description, the goal is to localize a segment in the video which semantically corresponds to the given natural language description. We propose a localizing network (LNet), working in an end-to-end fashion, to tackle the NLVL task. We first match the natural sentence and video sequence by cross-gated attended recurrent networks to exploit their fine-grained interactions and generate a sentence-aware video representation. A self interactor is proposed to perform crossframe matching, which dynamically encodes and aggregates the matching evidences. Finally, a boundary model is proposed to locate the positions of video segments corresponding to the natural sentence description by predicting the starting and ending points of the segment. Extensive experiments conducted on the public TACoS and DiDeMo datasets demonstrate that our proposed model performs effectively and efficiently against the state-of-the-art approaches.

Downloads

Published

2019-07-17

How to Cite

Chen, J., Ma, L., Chen, X., Jie, Z., & Luo, J. (2019). Localizing Natural Language in Videos. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01), 8175-8182. https://doi.org/10.1609/aaai.v33i01.33018175

Issue

Section

AAAI Technical Track: Vision