DiffDVC: Accurate Event Detection for Dense Video Captioning via Diffusion Models

Wei Chen; Jianwei Niu; Xuefeng Liu; Zhendong Wang; Shaojie Tang; Guogang Zhu

doi:10.1609/aaai.v39i2.32221

Authors

Wei Chen State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing, China
Jianwei Niu State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing, China Zhongguancun Laboratory, Beijing, China Zhengzhou University Research Institute of Industrial Technology, Zhengzhou University, Zhengzhou, China
Xuefeng Liu State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing, China Zhongguancun Laboratory, Beijing, China
Zhendong Wang State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing, China
Shaojie Tang Department of Management Science and Systems, University at Buffalo, Buffalo, New York, United States
Guogang Zhu State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing, China

DOI:

https://doi.org/10.1609/aaai.v39i2.32221

Abstract

Dense video captioning (DVC) aims to describe multiple events within a video, and its performance is greatly affected by the accuracy of video event detection. Video event detection involves predicting the proposal boundaries (start and end times) and the classification score of each event in a video. Recently, a few methods have applied diffusion models originally designed for image object detection to detect events in DVC. These methods add noise to the ground-truth event proposal boundaries, and subsequently learn the denoising process. However, these methods often overlook the fundamental differences between videos and images. We observe that, whereas in images the important information for object classification is normally around the boundaries of the ground-truth boxes, in videos the key information for event classification is typically centered in the middle of ground-truth event proposals. As a result, the classification module in these existing diffusion models becomes insensitive to boundary changes introduced by the added noise, leading to sub-optimal performance. This paper introduces DiffDVC, an innovative diffusion model for DVC. The core of DiffDVC is a boundary-sensitive detector. The detector increases the sensitivity of the classification module to boundary changes by focusing on frames within a specific range around the start and end times of noisy event proposals. Additionally, this range is dynamically adjusted to suit different event proposals. Comprehensive experiments on ActivityNet-1.3, ActivityNet Captions, and YouCook2 datasets show DiffDVC achieving superior performance.

DiffDVC: Accurate Event Detection for Dense Video Captioning via Diffusion Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information