Scaling-up Perceptual Video Quality Assessment
DOI:
https://doi.org/10.1609/aaai.v40i27.39386Abstract
The data scaling law has significantly enhanced large multi-modal models (LMMs) performance across various downstream tasks. However, in the domain of perceptual video quality assessment (VQA), the potential of data scaling remains unprecedented due to the scarcity of labeled resources and the insufficient scale of datasets. To address this, we propose OmniVQA, a framework designed to efficiently build high-quality, machine-dominated synthetic multi-modal instruction databases (MIDBs) for VQA. We then scale up to create OmniVQA-Chat-400K, the largest dataset in the VQA field concurrently. Our focus is on the technical and aesthetic quality dimensions, with abundant in-context instruction data to provide fine-grained VQA knowledge. Additionally, we build the OmniVQA-MOS-20K dataset to enhance the model's quantitative quality rating capabilities. We then introduce a complementary training strategy that effectively leverages the knowledge from datasets for different tasks. Furthermore, we propose the OmniVQA-FG (fine-grain)-Benchmark to evaluate the fine-grained performance of models. Our results demonstrate that our models achieve state-of-the-art performance in both tasks.Downloads
Published
2026-03-14
How to Cite
Jia, Z., Zhang, Z., Zhu, X., Li, C., Han, J., Liu, X., Zhai, G., & Min, X. (2026). Scaling-up Perceptual Video Quality Assessment. Proceedings of the AAAI Conference on Artificial Intelligence, 40(27), 22292-22300. https://doi.org/10.1609/aaai.v40i27.39386
Issue
Section
AAAI Technical Track on Machine Learning IV