Comprehensive Visual Grounding for Video Description

Authors

  • Wenhui Jiang Jiangxi University of Finance and Economics
  • Yibo Cheng Jiangxi University of Finance and Economics
  • Linxin Liu Jiangxi University of Finance and Economics
  • Yuming Fang Jiangxi University of Finance and Economics
  • Yuxin Peng Peking University
  • Yang Liu Sany Heavy Industry Co., LTD

DOI:

https://doi.org/10.1609/aaai.v38i3.28032

Keywords:

CV: Language and Vision, CV: Multi-modal Vision, CV: Video Understanding & Activity Analysis

Abstract

The grounding accuracy of existing video captioners is still behind the expectation. The majority of existing methods perform grounded video captioning on sparse entity annotations, whereas the captioning accuracy often suffers from degenerated object appearances on the annotated area such as motion blur and video defocus. Moreover, these methods seldom consider the complex interactions among entities. In this paper, we propose a comprehensive visual grounding network to improve video captioning, by explicitly linking the entities and actions to the visual clues across the video frames. Specifically, the network consists of spatial-temporal entity grounding and action grounding. The proposed entity grounding encourages the attention mechanism to focus on informative spatial areas across video frames, albeit the entity is annotated in only one frame of a video. The action grounding dynamically associates the verbs to related subjects and the corresponding context, which keeps fine-grained spatial and temporal details for action prediction. Both entity grounding and action grounding are formulated as a unified task guided by a soft grounding supervision, which brings architecture simplification and improves training efficiency as well. We conduct extensive experiments on two challenging datasets, and demonstrate significant performance improvements of +2.3 CIDEr on ActivityNet-Entities and +2.2 CIDEr on MSR-VTT compared to state-of-the-arts.

Published

2024-03-24

How to Cite

Jiang, W., Cheng, Y., Liu, L., Fang, Y., Peng, Y., & Liu, Y. (2024). Comprehensive Visual Grounding for Video Description. Proceedings of the AAAI Conference on Artificial Intelligence, 38(3), 2552–2560. https://doi.org/10.1609/aaai.v38i3.28032

Issue

Section

AAAI Technical Track on Computer Vision II