Comprehensive Visual Grounding for Video Description

Wenhui Jiang; Yibo Cheng; Linxin Liu; Yuming Fang; Yuxin Peng; Yang Liu

doi:10.1609/aaai.v38i3.28032

Authors

Wenhui Jiang Jiangxi University of Finance and Economics
Yibo Cheng Jiangxi University of Finance and Economics
Linxin Liu Jiangxi University of Finance and Economics
Yuming Fang Jiangxi University of Finance and Economics
Yuxin Peng Peking University
Yang Liu Sany Heavy Industry Co., LTD

DOI:

https://doi.org/10.1609/aaai.v38i3.28032

Keywords:

CV: Language and Vision, CV: Multi-modal Vision, CV: Video Understanding & Activity Analysis

Abstract

The grounding accuracy of existing video captioners is still behind the expectation. The majority of existing methods perform grounded video captioning on sparse entity annotations, whereas the captioning accuracy often suffers from degenerated object appearances on the annotated area such as motion blur and video defocus. Moreover, these methods seldom consider the complex interactions among entities. In this paper, we propose a comprehensive visual grounding network to improve video captioning, by explicitly linking the entities and actions to the visual clues across the video frames. Specifically, the network consists of spatial-temporal entity grounding and action grounding. The proposed entity grounding encourages the attention mechanism to focus on informative spatial areas across video frames, albeit the entity is annotated in only one frame of a video. The action grounding dynamically associates the verbs to related subjects and the corresponding context, which keeps fine-grained spatial and temporal details for action prediction. Both entity grounding and action grounding are formulated as a unified task guided by a soft grounding supervision, which brings architecture simplification and improves training efficiency as well. We conduct extensive experiments on two challenging datasets, and demonstrate significant performance improvements of +2.3 CIDEr on ActivityNet-Entities and +2.2 CIDEr on MSR-VTT compared to state-of-the-arts.

Comprehensive Visual Grounding for Video Description

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information