CogStream: Context-guided Streaming Video Question Answering

Authors

  • Zicheng Zhao Shanghai Jiao Tong University
  • Kangyu Wang Shanghai Jiao Tong University
  • Shijie Li Shanghai Jiao Tong University
  • Rui Qian The Chinese University of Hong Kong
  • Weiyao Lin Shanghai Jiao Tong University
  • Huabin Liu Shanghai Jiao Tong University

DOI:

https://doi.org/10.1609/aaai.v40i16.38336

Abstract

Despite advancements in Video Large Language Models (Vid-LLMs) improving multimodal understanding, challenges persist in streaming video reasoning due to its reliance on contextual information. Existing paradigms feed all available historical contextual information into Vid-LLMs, resulting in a significant computational burden for visual data processing. Furthermore, the inclusion of irrelevant context distracts models from key details. This paper introduces a challenging task called Context-guided Streaming Video Reasoning (CogStream), which simulates real-world streaming video scenarios, requiring models to identify the most relevant historical contextual information to deduce answers for questions about the current stream. To support CogStream, we present a densely annotated dataset featuring extensive and hierarchical question-answer pairs, generated by a semi-automatic pipeline. Additionally, we present CogReasoner as a baseline model. It effectively tackles this task by leveraging visual stream compression and historical dialogue retrieval. Extensive experiments prove the effectiveness of this method.

Downloads

Published

2026-03-14

How to Cite

Zhao, Z., Wang, K., Li, S., Qian, R., Lin, W., & Liu, H. (2026). CogStream: Context-guided Streaming Video Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16), 13332–13341. https://doi.org/10.1609/aaai.v40i16.38336

Issue

Section

AAAI Technical Track on Computer Vision XIII