ViG-RAG: Video-aware Graph Retrieval-Augmented Generation via Temporal and Semantic Hybrid Reasoning

Authors

  • Zongsheng Cao Shanghai Artificial Intelligence Laboratory Tsinghua University
  • Anran Liu Independent Researcher
  • Yangfan He Independent Researcher
  • Jing Li School of Economics and Management, Tsinghua University
  • Bo Zhang Shanghai Artificial Intelligence Laboratory
  • Zigan Wang School of Economics and Management, Tsinghua University, Shenzhen International Graduate School, Tsinghua University

DOI:

https://doi.org/10.1609/aaai.v40i1.36963

Abstract

Retrieval-augmented generation (RAG) has greatly improved Large Language Models (LLMs) by adding external knowledge. However, current RAG-based methods face difficulties with long-context video understanding due to two main challenges. First, Current RAG-based methods for long-context video understanding struggle to effectively integrate multimodal and long-range temporal information, resulting in fragmented and context-insensitive knowledge representations. Furthermore, their retrieval mechanisms often rely on static textual matching, failing to dynamically align user queries with the most relevant video segments and leading to suboptimal downstream performance. To overcome these issues, we introduce ViG-RAG, a new framework to enhance long-context video understanding through structured textual knowledge grounding and multi-modal retrieval. Specifically, we segment video transcripts into structured units, extract key entities, form temporal connections, and assign confidence for evidence, enabling coherent long-range reasoning. In this way, it utilizes a knowledge-aware grounding mechanism and a context-aware retrieval process that dynamically builds a probabilistic temporal knowledge graph to organize multi-video content. To improve retrieval accuracy, we propose a hybrid retrieval strategy for semantic and temporal features, with an adaptive distribution modeling the relevance. In this way, it achieves the optimal retrieval distribution for each query, enhancing generation efficiency by reducing unnecessary computations. On top of this, ViG-RAG uses a vision-language model to integrate semantic anchors, expanded contextual fields, and selected video frames, generating an accurate response. We evaluate ViG-RAG on several benchmarks, demonstrating that it significantly surpasses current RAG-based methods.

Downloads

Published

2026-03-14

How to Cite

Cao, Z., Liu, A., He, Y., Li, J., Zhang, B., & Wang, Z. (2026). ViG-RAG: Video-aware Graph Retrieval-Augmented Generation via Temporal and Semantic Hybrid Reasoning. Proceedings of the AAAI Conference on Artificial Intelligence, 40(1), 48–56. https://doi.org/10.1609/aaai.v40i1.36963

Issue

Section

AAAI Technical Track on Application Domains I