VAGU & GtS: LLM-Based Benchmark and Framework for Joint Video Anomaly Grounding and Understanding
DOI:
https://doi.org/10.1609/aaai.v40i6.42412Abstract
For video anomaly detection, it's both important to detect when the event happens and what the event is. The tasks of temporal grounding and semantic understanding can benefit from joint learning, but no existing work support it. To address this problem, we introduce VAGU (Video Anomaly Grounding and Understanding), the first benchmark designed to jointly evaluate semantic understanding and precise temporal grounding of anomalies, with comprehensive annotations and objective multiple-choice Video QA. Besides, we propose Glance then Scrutinize (GtS), the first training-free framework that achieves the best balance performance in both accuracy and efficiency. GtS uniquely balances high temporal precision and semantic interpretability while meeting practical speed requirements, outperforming previous methods in real-world scenarios. Furthermore, we introduce the JeAUG metric for holistic evaluation of both speed and accuracy. Extensive experiments demonstrate the superior effectiveness and practicality of our benchmark, framework, and metric.Downloads
Published
2026-03-14
How to Cite
Gao, S., Yang, P., Liu, Y., Chen, Y., Zhu, H., Zhang, X.-Y., & Huang, L. (2026). VAGU & GtS: LLM-Based Benchmark and Framework for Joint Video Anomaly Grounding and Understanding. Proceedings of the AAAI Conference on Artificial Intelligence, 40(6), 4167–4175. https://doi.org/10.1609/aaai.v40i6.42412
Issue
Section
AAAI Technical Track on Computer Vision III