Exploring Temporal Event Cues for Dense Video Captioning in Cyclic Co-Learning

Authors

  • Zhuyang Xie School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, China
  • Yan Yang School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, China Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, China
  • Yankai Yu School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, China
  • Jie Wang School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, China
  • Yongquan Jiang School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, China Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, China
  • Xiao Wu School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, China Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, China

DOI:

https://doi.org/10.1609/aaai.v39i8.32948

Abstract

Dense video captioning aims to detect and describe all events in untrimmed videos. This paper presents a dense video captioning network called Multi-Concept Cyclic Learning (MCCL), which aims to: (1) detect multiple concepts at the frame level and leverage these concepts to provide temporal event cues; and (2) establish cyclic co-learning between the generator and the localizer within the captioning network to promote semantic perception and event localization. Specifically, weakly supervised concept detection is performed for each frame, and the detected concept embeddings are integrated into the video features to provide event cues. Additionally, video-level concept contrastive learning is introduced to produce more discriminative concept embeddings. In the captioning network, a cyclic co-learning strategy is proposed, where the generator guides the localizer for event localization through semantic matching, while the localizer enhances the generator’s event semantic perception through location matching, making semantic perception and event localization mutually beneficial. MCCL achieves state-of-the-art performance on the ActivityNet Captions and YouCook2 datasets. Extensive experiments demonstrate its effectiveness and interpretability.

Downloads

Published

2025-04-11

How to Cite

Xie, Z., Yang, Y., Yu, Y., Wang, J., Jiang, Y., & Wu, X. (2025). Exploring Temporal Event Cues for Dense Video Captioning in Cyclic Co-Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 39(8), 8771–8779. https://doi.org/10.1609/aaai.v39i8.32948

Issue

Section

AAAI Technical Track on Computer Vision VII