VarCMP: Adapting Cross-Modal Pre-Training Models for Video Anomaly Retrieval

Authors

  • Peng Wu Northwestern Polytechnical University
  • Wanshun Su Northwestern Polytechnical University
  • Xiangteng He Peking University
  • Peng Wang Northwestern Polytechnical University
  • Yanning Zhang Northwestern Polytechnical University

DOI:

https://doi.org/10.1609/aaai.v39i8.32909

Abstract

Video anomaly retrieval (VAR) aims to retrieve pertinent abnormal or normal videos from collections of untrimmed and long videos through cross-modal requires such as textual descriptions and synchronized audios. Cross-modal pre-training (CMP) models, by pre-training on large-scale cross-modal pairs, e.g., image and text, can learn the rich associations between different modalities, and this cross-modal association capability gives CMP an advantage in conventional retrieval tasks. Inspired by this, how to utilize the robust cross-modal association capabilities of CMP in VAR to search crucial visual component from these untrimmed and long videos becomes a critical research problem. Therefore, this paper proposes a VAR method based on CMP models, named VarCMP. First, a unified hierarchical alignment strategy is proposed to constrain the semantic and spatial consistency between video and text, as well as the semantic, temporal, and spatial consistency between video and audio. It fully leverages the efficient cross-modal association capabilities of CMP models by considering cross-modal similarities at multiple granularities, enabling VarCMP to achieve effective all-round information matching for both video-text and video-audio VAR tasks. Moreover, to further solve the problem of untrimmed and long video alignment, an anomaly-biased weighting is devised in the fine-grained alignment, which identifies key segments in untrimmed long videos using anomaly priors, giving them more attention, thereby discarding irrelevant segment information, and achieving more accurate matching with cross-modal queries. Extensive experiments demonstrates high efficacy of VarCMP in both video-text and video-audio VAR tasks, achieving significant improvements on both text-video (UCFCrime-AR) and audio-video (XDViolence-AR) datasets against the best competitors by 5.0% and 5.3% R@1.

Downloads

Published

2025-04-11

How to Cite

Wu, P., Su, W., He, X., Wang, P., & Zhang, Y. (2025). VarCMP: Adapting Cross-Modal Pre-Training Models for Video Anomaly Retrieval. Proceedings of the AAAI Conference on Artificial Intelligence, 39(8), 8423–8431. https://doi.org/10.1609/aaai.v39i8.32909

Issue

Section

AAAI Technical Track on Computer Vision VII