NeuS-QA: Grounding Long-Form Video Understanding in Temporal Logic and Neuro-Symbolic Reasoning

Authors

  • Sahil Shah The University of Texas at Austin, Texas, USA
  • S P Sharan The University of Texas at Austin, Texas, USA
  • Harsh Goel The University of Texas at Austin, Texas, USA
  • Minkyu Choi The University of Texas at Austin, Texas, USA
  • Mustafa Munir The University of Texas at Austin, Texas, USA
  • Manvik Pasula Independent Researcher, USA
  • Radu Marculescu The University of Texas at Austin, Texas, USA
  • Sandeep Chinchali The University of Texas at Austin, Texas, USA

DOI:

https://doi.org/10.1609/aaai.v40i11.37834

Abstract

While vision-language models (VLMs) excel at tasks involving single images or short videos, they still struggle with Long Video Question Answering (LVQA) due to its demand for complex multi-step temporal reasoning. Vanilla approaches, which simply sample frames uniformly and feed them to a VLM along with the question, incur significant token overhead. This forces aggressive downsampling of long videos, causing models to miss fine-grained visual structure, subtle event transitions, and key temporal cues. Recent works attempt to overcome these limitations through heuristic approaches; however, they lack explicit mechanisms for encoding temporal relationships and fail to provide any formal guarantees that the sampled context actually encodes the compositional or causal logic required by the question. To address these foundational gaps, we introduce NeuS-QA, a training-free, plug-and-play neuro-symbolic pipeline for LVQA. NeuS-QA first translates a natural language question into a logic specification that models the temporal relationship between frame-level events. Next, we construct a video automaton to model the video's frame-by-frame event progression, and finally employ model checking to compare the automaton against the specification to identify all video segments that satisfy the question's logical requirements. Only these logic-verified segments are submitted to the VLM, thus improving interpretability, reducing hallucinations, and enabling compositional reasoning without modifying or fine-tuning the model. Experiments on the LongVideoBench and CinePile benchmarks show that NeuS-QA significantly improves performance by over 10%, particularly on questions involving event ordering, causality, and multi-step reasoning.

Downloads

Published

2026-03-14

How to Cite

Shah, S., Sharan, S. P., Goel, H., Choi, M., Munir, M., Pasula, M., … Chinchali, S. (2026). NeuS-QA: Grounding Long-Form Video Understanding in Temporal Logic and Neuro-Symbolic Reasoning. Proceedings of the AAAI Conference on Artificial Intelligence, 40(11), 8805–8813. https://doi.org/10.1609/aaai.v40i11.37834

Issue

Section

AAAI Technical Track on Computer Vision VIII