DigimonGPT: An Evolvable Agent with Hierarchical Human-like Memory for Video Question Answering

Authors

  • Borui Li Southeast University
  • Xingcai Zhang Southeast University
  • Tianen Liu Southeast University
  • Shuai Wang Southeast University
  • Yun Cheng ETHZ - ETH Zurich
  • Shuai Wang Southeast University

DOI:

https://doi.org/10.1609/aaai.v40i8.37523

Abstract

Video question answering (VideoQA), whose goal is to produce answers through the integration of linguistic and visual understanding, has emerged as a significant research focus. Although Large Multimodal Models (LMMs) and autonomous agent methods have achieved notable advances in VideoQA, excessive computational overhead and restricted multimodal interaction capabilities limit their ability to facilitate the continuous evolution of the VideoQA system. To address the challenge, we introduce DigimonGPT, an evolvable VideoQA agent inspired by cognitive psychology. Specifically, DigimonGPT integrates a multimodal memory mechanism to achieve the continuous evolution of VideoQA systems. An intra-video declarative memory contains fundamental features of the video and semantic contexts extracted from historical QA pairs. Another inter-task procedural memory encodes task-solving experience for further question answering. Additionally, we introduce a hierarchical memory replay mechanism for VideoQA that selects appropriate memories by their relevance and question complexity. Extensive experiments demonstrate that DigimonGPT's accuracy averagely outperforms 13.71% on NExT-QA datasets and 9.89% on Intent-QA datasets over LMM and autonomous agents.

Downloads

Published

2026-03-14

How to Cite

Li, B., Zhang, X., Liu, T., Wang, S., Cheng, Y., & Wang, S. (2026). DigimonGPT: An Evolvable Agent with Hierarchical Human-like Memory for Video Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence, 40(8), 6001–6009. https://doi.org/10.1609/aaai.v40i8.37523

Issue

Section

AAAI Technical Track on Computer Vision V