Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos

Authors

  • Qirui Chen School of Artificial Intelligence, Shanghai Jiao Tong University, China Coop. Medianet Innovation Center, Shanghai Jiao Tong University, China
  • Shangzhe Di School of Artificial Intelligence, Shanghai Jiao Tong University, China Coop. Medianet Innovation Center, Shanghai Jiao Tong University, China
  • Weidi Xie School of Artificial Intelligence, Shanghai Jiao Tong University, China

DOI:

https://doi.org/10.1609/aaai.v39i2.32214

Abstract

This paper considers the problem of Multi-Hop Video Question Answering (MH-VidQA) in long-form egocentric videos. This task not only requires to answer visual questions, but also to localize multiple relevant time intervals within the video as visual evidences. We develop an automated pipeline to create multi-hop question-answering pairs with associated temporal evidence, enabling to construct a large-scale dataset for instruction-tuning. To monitor the progress of this new task, we further curate a high-quality benchmark, MULTIHOP-EGOQA, with careful manual verification and refinement. Experimental results reveal that existing multimodal systems exhibit inadequate multi-hop grounding and reasoning abilities, resulting in unsatisfactory performance. We then propose a novel architecture, termed as Grounding Scattered Evidence with Large Language Model (GeLM), that enhances multi-modal large language models by incorporating a grounding module to retrieve temporal evidence from videos using flexible grounding tokens. Trained on our visual instruction-tuning data, GeLM demonstrates improved multi-hop grounding and reasoning capabilities, setting a baseline for this new task. Furthermore, when trained on third-person view videos, the same architecture also achieves state-of-the-art performance on the single-hop VidQA benchmark, ActivityNet-RTL, demonstrating its effectiveness.

Downloads

Published

2025-04-11

How to Cite

Chen, Q., Di, S., & Xie, W. (2025). Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos. Proceedings of the AAAI Conference on Artificial Intelligence, 39(2), 2159–2167. https://doi.org/10.1609/aaai.v39i2.32214

Issue

Section

AAAI Technical Track on Computer Vision I