TargetVAU: Multimodal Anomaly-Aware Reasoning for Target Behavior Understanding in Videos

Lingru Zhou; Peng Wu; Manqing Zhang; Qingsheng Wang; Guansong Pang; Peng Wang

doi:10.1609/aaai.v40i16.38378

Authors

Lingru Zhou Northwestern Polytechnical University
Peng Wu Northwestern Polytechnical University
Manqing Zhang Northwestern Polytechnical University
Qingsheng Wang Northwestern Polytechnical University
Guansong Pang Singapore Management University
Peng Wang Northwestern Polytechnical University

DOI:

https://doi.org/10.1609/aaai.v40i16.38378

Abstract

Understanding anomalous human behaviors at a fine-grained level remains a major challenge in complex scenarios. Existing video anomaly understanding (VAU) methods often rely on coarse frame-level cues or overlook structured modeling of individual actions, limiting their capacity for reasoning about human interactions and accountability. To address these challenges, we propose TargetVAU, a multimodal anomaly-aware reasoning framework designed for individual-level anomaly recognition and explanation. TargetVAU first extracts both global-level and human-centric visual features using a frozen Vision Transformer (ViT) encoder. An Anomaly-focused Temporal Sampler is then employed to select behaviorally informative frames via a density-aware strategy guided by predicted anomaly scores. A Spatio-Temporal Interaction Graph is constructed to explicitly model interactions among individuals across time and space. These structured representations are fused with prompt embeddings via a frozen Q-Former to form a unified semantic representation. Finally, a large language model fine-tuned with low-rank adaptation (LoRA) performs instruction-guided reasoning to identify anomalous individuals and generate natural language explanations. Extensive experiments on UCCD and HIVAU-70K demonstrate that TargetVAU significantly outperforms existing methods in both accuracy and interpretability, advancing the state of individual-level anomaly understanding in surveillance videos.

TargetVAU: Multimodal Anomaly-Aware Reasoning for Target Behavior Understanding in Videos

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information