TargetVAU: Multimodal Anomaly-Aware Reasoning for Target Behavior Understanding in Videos

Authors

  • Lingru Zhou Northwestern Polytechnical University
  • Peng Wu Northwestern Polytechnical University
  • Manqing Zhang Northwestern Polytechnical University
  • Qingsheng Wang Northwestern Polytechnical University
  • Guansong Pang Singapore Management University
  • Peng Wang Northwestern Polytechnical University

DOI:

https://doi.org/10.1609/aaai.v40i16.38378

Abstract

Understanding anomalous human behaviors at a fine-grained level remains a major challenge in complex scenarios. Existing video anomaly understanding (VAU) methods often rely on coarse frame-level cues or overlook structured modeling of individual actions, limiting their capacity for reasoning about human interactions and accountability. To address these challenges, we propose TargetVAU, a multimodal anomaly-aware reasoning framework designed for individual-level anomaly recognition and explanation. TargetVAU first extracts both global-level and human-centric visual features using a frozen Vision Transformer (ViT) encoder. An Anomaly-focused Temporal Sampler is then employed to select behaviorally informative frames via a density-aware strategy guided by predicted anomaly scores. A Spatio-Temporal Interaction Graph is constructed to explicitly model interactions among individuals across time and space. These structured representations are fused with prompt embeddings via a frozen Q-Former to form a unified semantic representation. Finally, a large language model fine-tuned with low-rank adaptation (LoRA) performs instruction-guided reasoning to identify anomalous individuals and generate natural language explanations. Extensive experiments on UCCD and HIVAU-70K demonstrate that TargetVAU significantly outperforms existing methods in both accuracy and interpretability, advancing the state of individual-level anomaly understanding in surveillance videos.

Downloads

Published

2026-03-14

How to Cite

Zhou, L., Wu, P., Zhang, M., Wang, Q., Pang, G., & Wang, P. (2026). TargetVAU: Multimodal Anomaly-Aware Reasoning for Target Behavior Understanding in Videos. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16), 13710–13718. https://doi.org/10.1609/aaai.v40i16.38378

Issue

Section

AAAI Technical Track on Computer Vision XIII