VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation
DOI:
https://doi.org/10.1609/aaai.v40i15.38269Abstract
Video captions play a crucial role in text-to-video generation tasks, as their quality directly influences the semantic coherence and visual fidelity of the generated videos. Although large vision-language models (VLMs) have demonstrated significant potential in caption generation, existing benchmarks inadequately address fine-grained evaluation, particularly in capturing spatial-temporal details critical for video generation. To address this gap, we introduce the Fine-grained Video Caption Evaluation Benchmark (VCapsBench), the first large-scale fine-grained benchmark comprising 5,677 (5K+) videos and 109,796 (100K+) question-answer pairs. These QA-pairs are systematically annotated across 21 fine-grained dimensions (e.g., camera movement, and shot type) that are empirically proven critical for text-to-video generation. We further introduce three metrics (Accuracy (AR), Inconsistency Rate (IR), Coverage Rate (CR)), and an automated evaluation pipeline leveraging a large language model (LLM) to verify caption quality via contrastive QA-pairs analysis. Our benchmark can advance the development of robust text-to-video models by providing actionable insights for caption optimization.Downloads
Published
2026-03-14
How to Cite
Zhang, S.-X., Wang, H., Huang, D., Li, X., Zhu, X., & Yin, X.-C. (2026). VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(15), 12726–12734. https://doi.org/10.1609/aaai.v40i15.38269
Issue
Section
AAAI Technical Track on Computer Vision XII