CLASP: Cross-modal Salient Anchor-based Semantic Propagation for Weakly-supervised Dense Audio-Visual Event Localization

Jinxing Zhou; Ziheng Zhou; Yanghao Zhou; Yuxin Mao; Zhangling Duan; Dan Guo

doi:10.1609/aaai.v40i16.38374

Authors

Jinxing Zhou Mohamed bin Zayed University of Artificial Intelligence
Ziheng Zhou Hefei University of Technology
Yanghao Zhou National University of Singapore
Yuxin Mao OpenNLP Lab
Zhangling Duan Hefei Comprehensive National Science Center
Dan Guo Hefei University of Technology Institute of Artificial Intelligence, Hefei Comprehensive National Science Center

DOI:

https://doi.org/10.1609/aaai.v40i16.38374

Abstract

The Dense Audio-Visual Event Localization (DAVEL) task aims to temporally localize events in untrimmed videos that occur simultaneously in both the audio and visual modalities. This paper explores DAVEL under a new and more challenging weakly-supervised setting (W-DAVEL task), where only video-level event labels are provided and the temporal boundaries of each event are unknown. We address W-DAVEL by exploiting cross-modal salient anchors, which are defined as reliable timestamps that are well predicted under weak supervision and exhibit highly consistent event semantics across audio and visual modalities. Specifically, we propose a Mutual Event Agreement Evaluation module, which generates an agreement score by measuring the discrepancy between the predicted audio and visual event classes. Then, the agreement score is utilized in a Cross-modal Salient Anchor Identification module, which identifies the audio and visual anchor features through global-video and local temporal window identification mechanisms. The anchor features after multimodal integration are fed into an Anchor-based Temporal Propagation module to enhance event semantic encoding in the original temporal audio and visual features, facilitating better temporal localization under weak supervision. We establish benchmarks for W-DAVEL on both the UnAV-100 and ActivityNet1.3 datasets. Extensive experiments demonstrate that our method achieves state-of-the-art performance.

CLASP: Cross-modal Salient Anchor-based Semantic Propagation for Weakly-supervised Dense Audio-Visual Event Localization

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information