VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation

Authors

  • Shi-Xue Zhang University of Science and Technology Beijing, China Tencent Technology (Shenzhen) Co. Ltd, China
  • Hongfa Wang Tsinghua Shenzhen International Graduate School, Tsinghua University, China Tencent Technology (Shenzhen) Co. Ltd, China
  • Duojun Huang Tencent Technology (Shenzhen) Co. Ltd, China
  • Xin Li Tencent Technology (Shenzhen) Co. Ltd, China
  • Xiaobin Zhu University of Science and Technology Beijing, China
  • Xu-Cheng Yin University of Science and Technology Beijing, China

DOI:

https://doi.org/10.1609/aaai.v40i15.38269

Abstract

Video captions play a crucial role in text-to-video generation tasks, as their quality directly influences the semantic coherence and visual fidelity of the generated videos. Although large vision-language models (VLMs) have demonstrated significant potential in caption generation, existing benchmarks inadequately address fine-grained evaluation, particularly in capturing spatial-temporal details critical for video generation. To address this gap, we introduce the Fine-grained Video Caption Evaluation Benchmark (VCapsBench), the first large-scale fine-grained benchmark comprising 5,677 (5K+) videos and 109,796 (100K+) question-answer pairs. These QA-pairs are systematically annotated across 21 fine-grained dimensions (e.g., camera movement, and shot type) that are empirically proven critical for text-to-video generation. We further introduce three metrics (Accuracy (AR), Inconsistency Rate (IR), Coverage Rate (CR)), and an automated evaluation pipeline leveraging a large language model (LLM) to verify caption quality via contrastive QA-pairs analysis. Our benchmark can advance the development of robust text-to-video models by providing actionable insights for caption optimization.

Published

2026-03-14

How to Cite

Zhang, S.-X., Wang, H., Huang, D., Li, X., Zhu, X., & Yin, X.-C. (2026). VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(15), 12726–12734. https://doi.org/10.1609/aaai.v40i15.38269

Issue

Section

AAAI Technical Track on Computer Vision XII