DiffDVC: Accurate Event Detection for Dense Video Captioning via Diffusion Models

Authors

  • Wei Chen State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing, China
  • Jianwei Niu State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing, China Zhongguancun Laboratory, Beijing, China Zhengzhou University Research Institute of Industrial Technology, Zhengzhou University, Zhengzhou, China
  • Xuefeng Liu State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing, China Zhongguancun Laboratory, Beijing, China
  • Zhendong Wang State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing, China
  • Shaojie Tang Department of Management Science and Systems, University at Buffalo, Buffalo, New York, United States
  • Guogang Zhu State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing, China

DOI:

https://doi.org/10.1609/aaai.v39i2.32221

Abstract

Dense video captioning (DVC) aims to describe multiple events within a video, and its performance is greatly affected by the accuracy of video event detection. Video event detection involves predicting the proposal boundaries (start and end times) and the classification score of each event in a video. Recently, a few methods have applied diffusion models originally designed for image object detection to detect events in DVC. These methods add noise to the ground-truth event proposal boundaries, and subsequently learn the denoising process. However, these methods often overlook the fundamental differences between videos and images. We observe that, whereas in images the important information for object classification is normally around the boundaries of the ground-truth boxes, in videos the key information for event classification is typically centered in the middle of ground-truth event proposals. As a result, the classification module in these existing diffusion models becomes insensitive to boundary changes introduced by the added noise, leading to sub-optimal performance. This paper introduces DiffDVC, an innovative diffusion model for DVC. The core of DiffDVC is a boundary-sensitive detector. The detector increases the sensitivity of the classification module to boundary changes by focusing on frames within a specific range around the start and end times of noisy event proposals. Additionally, this range is dynamically adjusted to suit different event proposals. Comprehensive experiments on ActivityNet-1.3, ActivityNet Captions, and YouCook2 datasets show DiffDVC achieving superior performance.

Downloads

Published

2025-04-11

How to Cite

Chen, W., Niu, J., Liu, X., Wang, Z., Tang, S., & Zhu, G. (2025). DiffDVC: Accurate Event Detection for Dense Video Captioning via Diffusion Models. Proceedings of the AAAI Conference on Artificial Intelligence, 39(2), 2221–2229. https://doi.org/10.1609/aaai.v39i2.32221

Issue

Section

AAAI Technical Track on Computer Vision I