TargetVAU: Multimodal Anomaly-Aware Reasoning for Target Behavior Understanding in Videos
DOI:
https://doi.org/10.1609/aaai.v40i16.38378Abstract
Understanding anomalous human behaviors at a fine-grained level remains a major challenge in complex scenarios. Existing video anomaly understanding (VAU) methods often rely on coarse frame-level cues or overlook structured modeling of individual actions, limiting their capacity for reasoning about human interactions and accountability. To address these challenges, we propose TargetVAU, a multimodal anomaly-aware reasoning framework designed for individual-level anomaly recognition and explanation. TargetVAU first extracts both global-level and human-centric visual features using a frozen Vision Transformer (ViT) encoder. An Anomaly-focused Temporal Sampler is then employed to select behaviorally informative frames via a density-aware strategy guided by predicted anomaly scores. A Spatio-Temporal Interaction Graph is constructed to explicitly model interactions among individuals across time and space. These structured representations are fused with prompt embeddings via a frozen Q-Former to form a unified semantic representation. Finally, a large language model fine-tuned with low-rank adaptation (LoRA) performs instruction-guided reasoning to identify anomalous individuals and generate natural language explanations. Extensive experiments on UCCD and HIVAU-70K demonstrate that TargetVAU significantly outperforms existing methods in both accuracy and interpretability, advancing the state of individual-level anomaly understanding in surveillance videos.Published
2026-03-14
How to Cite
Zhou, L., Wu, P., Zhang, M., Wang, Q., Pang, G., & Wang, P. (2026). TargetVAU: Multimodal Anomaly-Aware Reasoning for Target Behavior Understanding in Videos. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16), 13710–13718. https://doi.org/10.1609/aaai.v40i16.38378
Issue
Section
AAAI Technical Track on Computer Vision XIII