VisAssist: A Visually Impaired-Captured Video Question Answering Benchmark for Assistive Systems

Qi Gao; Heng Li; Yixin Zhou; Meixuan Zhou; Jieqiong Chen; Xinyu Chai

doi:10.1609/aaai.v40i6.42410

Authors

Qi Gao School of Biomedical Engineering, Shanghai Jiao Tong University, China
Heng Li School of Biomedical Engineering, Shanghai Jiao Tong University, China
Yixin Zhou School of Biomedical Engineering, Shanghai Jiao Tong University, China
Meixuan Zhou School of Biomedical Engineering, Shanghai Jiao Tong University, China
Jieqiong Chen Department of Ophthalmology, Shanghai General Hospital, China
Xinyu Chai School of Biomedical Engineering, Shanghai Jiao Tong University, China

DOI:

https://doi.org/10.1609/aaai.v40i6.42410

Abstract

We present VisAssist, the first large-scale video question-answering dataset with 13,413 real-world videos captured by visually impaired users, addressing a critical gap in assistive vision research. Unlike existing benchmarks relying on third-person footage, VisAssist provides authentic first-person perspectives that uniquely capture challenges in blind photography—including unconventional framing, motion artifacts, and frequent information omission. Benchmark evaluations of SOTA multimodal models reveal systematic limitations: severe deficiencies in spatial reasoning when processing dynamic first-person viewpoints, an inability to distinguish missing information from poor capture quality leading to hazardous hallucinations, and fragile text understanding especially for non-Latin scripts under suboptimal conditions. This work establishes a vital real-world benchmark and underscores the need for specialized architectures in visual assistance systems.

VisAssist: A Visually Impaired-Captured Video Question Answering Benchmark for Assistive Systems

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information