VisAssist: A Visually Impaired-Captured Video Question Answering Benchmark for Assistive Systems

Authors

  • Qi Gao School of Biomedical Engineering, Shanghai Jiao Tong University, China
  • Heng Li School of Biomedical Engineering, Shanghai Jiao Tong University, China
  • Yixin Zhou School of Biomedical Engineering, Shanghai Jiao Tong University, China
  • Meixuan Zhou School of Biomedical Engineering, Shanghai Jiao Tong University, China
  • Jieqiong Chen Department of Ophthalmology, Shanghai General Hospital, China
  • Xinyu Chai School of Biomedical Engineering, Shanghai Jiao Tong University, China

DOI:

https://doi.org/10.1609/aaai.v40i6.42410

Abstract

We present VisAssist, the first large-scale video question-answering dataset with 13,413 real-world videos captured by visually impaired users, addressing a critical gap in assistive vision research. Unlike existing benchmarks relying on third-person footage, VisAssist provides authentic first-person perspectives that uniquely capture challenges in blind photography—including unconventional framing, motion artifacts, and frequent information omission. Benchmark evaluations of SOTA multimodal models reveal systematic limitations: severe deficiencies in spatial reasoning when processing dynamic first-person viewpoints, an inability to distinguish missing information from poor capture quality leading to hazardous hallucinations, and fragile text understanding especially for non-Latin scripts under suboptimal conditions. This work establishes a vital real-world benchmark and underscores the need for specialized architectures in visual assistance systems.

Published

2026-03-14

How to Cite

Gao, Q., Li, H., Zhou, Y., Zhou, M., Chen, J., & Chai, X. (2026). VisAssist: A Visually Impaired-Captured Video Question Answering Benchmark for Assistive Systems. Proceedings of the AAAI Conference on Artificial Intelligence, 40(6), 4149–4157. https://doi.org/10.1609/aaai.v40i6.42410

Issue

Section

AAAI Technical Track on Computer Vision III