CLASP: Cross-modal Salient Anchor-based Semantic Propagation for Weakly-supervised Dense Audio-Visual Event Localization
DOI:
https://doi.org/10.1609/aaai.v40i16.38374Abstract
The Dense Audio-Visual Event Localization (DAVEL) task aims to temporally localize events in untrimmed videos that occur simultaneously in both the audio and visual modalities. This paper explores DAVEL under a new and more challenging weakly-supervised setting (W-DAVEL task), where only video-level event labels are provided and the temporal boundaries of each event are unknown. We address W-DAVEL by exploiting cross-modal salient anchors, which are defined as reliable timestamps that are well predicted under weak supervision and exhibit highly consistent event semantics across audio and visual modalities. Specifically, we propose a Mutual Event Agreement Evaluation module, which generates an agreement score by measuring the discrepancy between the predicted audio and visual event classes. Then, the agreement score is utilized in a Cross-modal Salient Anchor Identification module, which identifies the audio and visual anchor features through global-video and local temporal window identification mechanisms. The anchor features after multimodal integration are fed into an Anchor-based Temporal Propagation module to enhance event semantic encoding in the original temporal audio and visual features, facilitating better temporal localization under weak supervision. We establish benchmarks for W-DAVEL on both the UnAV-100 and ActivityNet1.3 datasets. Extensive experiments demonstrate that our method achieves state-of-the-art performance.Published
2026-03-14
How to Cite
Zhou, J., Zhou, Z., Zhou, Y., Mao, Y., Duan, Z., & Guo, D. (2026). CLASP: Cross-modal Salient Anchor-based Semantic Propagation for Weakly-supervised Dense Audio-Visual Event Localization. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16), 13674–13682. https://doi.org/10.1609/aaai.v40i16.38374
Issue
Section
AAAI Technical Track on Computer Vision XIII