VAGU & GtS: LLM-Based Benchmark and Framework for Joint Video Anomaly Grounding and Understanding

Shibo Gao; Peipei Yang; Yangyang Liu; Yi Chen; Han Zhu; Xu-Yao Zhang; Linlin Huang

doi:10.1609/aaai.v40i6.42412

Authors

Shibo Gao Beijing Jiaotong University State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences
Peipei Yang State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences
Yangyang Liu State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences
Yi Chen State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences
Han Zhu State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences
Xu-Yao Zhang State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences
Linlin Huang Beijing Jiaotong University

DOI:

https://doi.org/10.1609/aaai.v40i6.42412

Abstract

For video anomaly detection, it's both important to detect when the event happens and what the event is. The tasks of temporal grounding and semantic understanding can benefit from joint learning, but no existing work support it. To address this problem, we introduce VAGU (Video Anomaly Grounding and Understanding), the first benchmark designed to jointly evaluate semantic understanding and precise temporal grounding of anomalies, with comprehensive annotations and objective multiple-choice Video QA. Besides, we propose Glance then Scrutinize (GtS), the first training-free framework that achieves the best balance performance in both accuracy and efficiency. GtS uniquely balances high temporal precision and semantic interpretability while meeting practical speed requirements, outperforming previous methods in real-world scenarios. Furthermore, we introduce the JeAUG metric for holistic evaluation of both speed and accuracy. Extensive experiments demonstrate the superior effectiveness and practicality of our benchmark, framework, and metric.

VAGU & GtS: LLM-Based Benchmark and Framework for Joint Video Anomaly Grounding and Understanding

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information