HARK: Hierarchical Agentic Retrieval with Keyframing for Video Understanding (Student Abstract)

Authors

  • Jingcheng Li University of California, San Diego
  • Ye Qiao University of California, Irvine
  • Sitao Huang University of California, Irvine

DOI:

https://doi.org/10.1609/aaai.v40i48.42237

Abstract

Current video understanding models struggle with temporal reasoning and efficient processing while balancing detail preservation with computational efficiency. We propose a hierarchical memory system that segments videos into action and scene units, combined with question-aware agentic keyframe selection. Our method achieves 70.3% overall accuracy on VideoMME short video benchmarks.

Published

2026-03-14

How to Cite

Li, J., Qiao, Y., & Huang, S. (2026). HARK: Hierarchical Agentic Retrieval with Keyframing for Video Understanding (Student Abstract). Proceedings of the AAAI Conference on Artificial Intelligence, 40(48), 41266–41268. https://doi.org/10.1609/aaai.v40i48.42237