Leveraging Textual Memory and Key Frame Reasoning for Full Video Understanding Using Off-the-Shelf LLMs and VLMs (Student Abstract)
DOI:
https://doi.org/10.1609/aaai.v39i28.35248Abstract
To address the limitations of current Large-scale Video-Language Models (LVLMs) in fine-grained understanding and long-term temporal memory, we propose a novel video understanding approach that integrates a Vision Language Model (VLM) and a Large Language Model (LLM) with a textual memory mechanism to ensure continuity and contextual coherence. In addition, we introduce a novel evaluation metric, VAD-Score (Video Automated Description Score), to assess precision, recall, and F1 scores for events, subjects, and objects. Our approach delivers competitive results on a diverse set of videos from the DREAM-1K dataset, spanning categories such as live-action, animation, shorts, stock, and YouTube, with a focus on fine-grained comprehension.Published
2025-04-11
How to Cite
Dubey, H., & Pack, C. (2025). Leveraging Textual Memory and Key Frame Reasoning for Full Video Understanding Using Off-the-Shelf LLMs and VLMs (Student Abstract). Proceedings of the AAAI Conference on Artificial Intelligence, 39(28), 29351-29352. https://doi.org/10.1609/aaai.v39i28.35248
Issue
Section
AAAI Student Abstract and Poster Program