Exploring Temporal Event Cues for Dense Video Captioning in Cyclic Co-Learning

Zhuyang Xie; Yan Yang; Yankai Yu; Jie Wang; Yongquan Jiang; Xiao Wu

doi:10.1609/aaai.v39i8.32948

Authors

Zhuyang Xie School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, China
Yan Yang School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, China Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, China
Yankai Yu School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, China
Jie Wang School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, China
Yongquan Jiang School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, China Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, China
Xiao Wu School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, China Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, China

DOI:

https://doi.org/10.1609/aaai.v39i8.32948

Abstract

Dense video captioning aims to detect and describe all events in untrimmed videos. This paper presents a dense video captioning network called Multi-Concept Cyclic Learning (MCCL), which aims to: (1) detect multiple concepts at the frame level and leverage these concepts to provide temporal event cues; and (2) establish cyclic co-learning between the generator and the localizer within the captioning network to promote semantic perception and event localization. Specifically, weakly supervised concept detection is performed for each frame, and the detected concept embeddings are integrated into the video features to provide event cues. Additionally, video-level concept contrastive learning is introduced to produce more discriminative concept embeddings. In the captioning network, a cyclic co-learning strategy is proposed, where the generator guides the localizer for event localization through semantic matching, while the localizer enhances the generator’s event semantic perception through location matching, making semantic perception and event localization mutually beneficial. MCCL achieves state-of-the-art performance on the ActivityNet Captions and YouCook2 datasets. Extensive experiments demonstrate its effectiveness and interpretability.

Exploring Temporal Event Cues for Dense Video Captioning in Cyclic Co-Learning

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information