VAGU & GtS: LLM-Based Benchmark and Framework for Joint Video Anomaly Grounding and Understanding

Authors

  • Shibo Gao Beijing Jiaotong University State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences
  • Peipei Yang State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences
  • Yangyang Liu State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences
  • Yi Chen State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences
  • Han Zhu State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences
  • Xu-Yao Zhang State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences
  • Linlin Huang Beijing Jiaotong University

DOI:

https://doi.org/10.1609/aaai.v40i6.42412

Abstract

For video anomaly detection, it's both important to detect when the event happens and what the event is. The tasks of temporal grounding and semantic understanding can benefit from joint learning, but no existing work support it. To address this problem, we introduce VAGU (Video Anomaly Grounding and Understanding), the first benchmark designed to jointly evaluate semantic understanding and precise temporal grounding of anomalies, with comprehensive annotations and objective multiple-choice Video QA. Besides, we propose Glance then Scrutinize (GtS), the first training-free framework that achieves the best balance performance in both accuracy and efficiency. GtS uniquely balances high temporal precision and semantic interpretability while meeting practical speed requirements, outperforming previous methods in real-world scenarios. Furthermore, we introduce the JeAUG metric for holistic evaluation of both speed and accuracy. Extensive experiments demonstrate the superior effectiveness and practicality of our benchmark, framework, and metric.

Downloads

Published

2026-03-14

How to Cite

Gao, S., Yang, P., Liu, Y., Chen, Y., Zhu, H., Zhang, X.-Y., & Huang, L. (2026). VAGU & GtS: LLM-Based Benchmark and Framework for Joint Video Anomaly Grounding and Understanding. Proceedings of the AAAI Conference on Artificial Intelligence, 40(6), 4167–4175. https://doi.org/10.1609/aaai.v40i6.42412

Issue

Section

AAAI Technical Track on Computer Vision III