DigimonGPT: An Evolvable Agent with Hierarchical Human-like Memory for Video Question Answering

Borui Li; Xingcai Zhang; Tianen Liu; Shuai Wang; Yun Cheng; Shuai Wang

doi:10.1609/aaai.v40i8.37523

Authors

Borui Li Southeast University
Xingcai Zhang Southeast University
Tianen Liu Southeast University
Shuai Wang Southeast University
Yun Cheng ETHZ - ETH Zurich
Shuai Wang Southeast University

DOI:

https://doi.org/10.1609/aaai.v40i8.37523

Abstract

Video question answering (VideoQA), whose goal is to produce answers through the integration of linguistic and visual understanding, has emerged as a significant research focus. Although Large Multimodal Models (LMMs) and autonomous agent methods have achieved notable advances in VideoQA, excessive computational overhead and restricted multimodal interaction capabilities limit their ability to facilitate the continuous evolution of the VideoQA system. To address the challenge, we introduce DigimonGPT, an evolvable VideoQA agent inspired by cognitive psychology. Specifically, DigimonGPT integrates a multimodal memory mechanism to achieve the continuous evolution of VideoQA systems. An intra-video declarative memory contains fundamental features of the video and semantic contexts extracted from historical QA pairs. Another inter-task procedural memory encodes task-solving experience for further question answering. Additionally, we introduce a hierarchical memory replay mechanism for VideoQA that selects appropriate memories by their relevance and question complexity. Extensive experiments demonstrate that DigimonGPT's accuracy averagely outperforms 13.71% on NExT-QA datasets and 9.89% on Intent-QA datasets over LMM and autonomous agents.

DigimonGPT: An Evolvable Agent with Hierarchical Human-like Memory for Video Question Answering

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information