STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes

Authors

  • Keishi Ishihara Turing Inc.
  • Kento Sasaki Turing Inc. University of Tsukuba
  • Tsubasa Takahashi Turing Inc.
  • Daiki Shiono Turing Inc. Tohoku University
  • Yu Yamaguchi Turing Inc.

DOI:

https://doi.org/10.1609/aaai.v40i7.37441

Abstract

Vision-Language Models (VLMs) have been applied to autonomous driving to support decision-making in complex real-world scenarios. However, their training on static, web-sourced image-text pairs fundamentally limits the precise spatiotemporal reasoning required to understand and predict dynamic traffic scenes. We address this critical gap with STRIDE-QA, a large-scale visual question answering (VQA) dataset for physically grounded reasoning from an ego-centric perspective. Constructed from 100 hours of multi-sensor driving data in Tokyo, capturing diverse and challenging conditions, STRIDE-QA is the largest VQA dataset for spatiotemporal reasoning in urban driving, offering 16 M QA pairs over 270 K frames. Grounded by dense, automatically generated annotations including 3D bounding boxes, segmentation masks, and multi-object tracks, the dataset uniquely supports both object-centric and ego-centric reasoning through three novel QA tasks that require spatial localization and temporal prediction. Our benchmarks demonstrate that existing VLMs struggle significantly, with near-zero scores on prediction consistency. In contrast, VLMs fine-tuned on STRIDE-QA exhibit dramatic performance gains, achieving 55% success in spatial localization and 28% consistency in future motion prediction, compared to near-zero scores from general-purpose VLMs. Therefore, STRIDE-QA establishes a comprehensive foundation for developing more reliable VLMs for safety-critical autonomous systems.

Published

2026-03-14

How to Cite

Ishihara, K., Sasaki, K., Takahashi, T., Shiono, D., & Yamaguchi, Y. (2026). STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes. Proceedings of the AAAI Conference on Artificial Intelligence, 40(7), 5257–5266. https://doi.org/10.1609/aaai.v40i7.37441

Issue

Section

AAAI Technical Track on Computer Vision IV