ViG-RAG: Video-aware Graph Retrieval-Augmented Generation via Temporal and Semantic Hybrid Reasoning

Zongsheng Cao; Anran Liu; Yangfan He; Jing Li; Bo Zhang; Zigan Wang

doi:10.1609/aaai.v40i1.36963

Authors

Zongsheng Cao Shanghai Artificial Intelligence Laboratory Tsinghua University
Anran Liu Independent Researcher
Yangfan He Independent Researcher
Jing Li School of Economics and Management, Tsinghua University
Bo Zhang Shanghai Artificial Intelligence Laboratory
Zigan Wang School of Economics and Management, Tsinghua University, Shenzhen International Graduate School, Tsinghua University

DOI:

https://doi.org/10.1609/aaai.v40i1.36963

Abstract

Retrieval-augmented generation (RAG) has greatly improved Large Language Models (LLMs) by adding external knowledge. However, current RAG-based methods face difficulties with long-context video understanding due to two main challenges. First, Current RAG-based methods for long-context video understanding struggle to effectively integrate multimodal and long-range temporal information, resulting in fragmented and context-insensitive knowledge representations. Furthermore, their retrieval mechanisms often rely on static textual matching, failing to dynamically align user queries with the most relevant video segments and leading to suboptimal downstream performance. To overcome these issues, we introduce ViG-RAG, a new framework to enhance long-context video understanding through structured textual knowledge grounding and multi-modal retrieval. Specifically, we segment video transcripts into structured units, extract key entities, form temporal connections, and assign confidence for evidence, enabling coherent long-range reasoning. In this way, it utilizes a knowledge-aware grounding mechanism and a context-aware retrieval process that dynamically builds a probabilistic temporal knowledge graph to organize multi-video content. To improve retrieval accuracy, we propose a hybrid retrieval strategy for semantic and temporal features, with an adaptive distribution modeling the relevance. In this way, it achieves the optimal retrieval distribution for each query, enhancing generation efficiency by reducing unnecessary computations. On top of this, ViG-RAG uses a vision-language model to integrate semantic anchors, expanded contextual fields, and selected video frames, generating an accurate response. We evaluate ViG-RAG on several benchmarks, demonstrating that it significantly surpasses current RAG-based methods.

ViG-RAG: Video-aware Graph Retrieval-Augmented Generation via Temporal and Semantic Hybrid Reasoning

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information