Leveraging Textual Memory and Key Frame Reasoning for Full Video Understanding Using Off-the-Shelf LLMs and VLMs (Student Abstract)

Authors

  • Harsh Dubey South Dakota State University
  • Chulwoo Pack South Dakota State University

DOI:

https://doi.org/10.1609/aaai.v39i28.35248

Abstract

To address the limitations of current Large-scale Video-Language Models (LVLMs) in fine-grained understanding and long-term temporal memory, we propose a novel video understanding approach that integrates a Vision Language Model (VLM) and a Large Language Model (LLM) with a textual memory mechanism to ensure continuity and contextual coherence. In addition, we introduce a novel evaluation metric, VAD-Score (Video Automated Description Score), to assess precision, recall, and F1 scores for events, subjects, and objects. Our approach delivers competitive results on a diverse set of videos from the DREAM-1K dataset, spanning categories such as live-action, animation, shorts, stock, and YouTube, with a focus on fine-grained comprehension.

Downloads

Published

2025-04-11

How to Cite

Dubey, H., & Pack, C. (2025). Leveraging Textual Memory and Key Frame Reasoning for Full Video Understanding Using Off-the-Shelf LLMs and VLMs (Student Abstract). Proceedings of the AAAI Conference on Artificial Intelligence, 39(28), 29351-29352. https://doi.org/10.1609/aaai.v39i28.35248