Localizing Natural Language in Videos

Jingyuan Chen; Lin Ma; Xinpeng Chen; Zequn Jie; Jiebo Luo

doi:10.1609/aaai.v33i01.33018175

Authors

Jingyuan Chen National University of Singapore
Lin Ma Tencent AI Lab
Xinpeng Chen Tencent AI Lab
Zequn Jie Tencent AI Lab
Jiebo Luo University of Rochester

DOI:

https://doi.org/10.1609/aaai.v33i01.33018175

Abstract

In this paper, we consider the task of natural language video localization (NLVL): given an untrimmed video and a natural language description, the goal is to localize a segment in the video which semantically corresponds to the given natural language description. We propose a localizing network (LNet), working in an end-to-end fashion, to tackle the NLVL task. We first match the natural sentence and video sequence by cross-gated attended recurrent networks to exploit their fine-grained interactions and generate a sentence-aware video representation. A self interactor is proposed to perform crossframe matching, which dynamically encodes and aggregates the matching evidences. Finally, a boundary model is proposed to locate the positions of video segments corresponding to the natural sentence description by predicting the starting and ending points of the segment. Extensive experiments conducted on the public TACoS and DiDeMo datasets demonstrate that our proposed model performs effectively and efficiently against the state-of-the-art approaches.

Localizing Natural Language in Videos

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information