Scaling-up Perceptual Video Quality Assessment

Ziheng Jia; Zicheng Zhang; Xiaorong Zhu; Chunyi Li; Jinliang Han; Xiaohong Liu; Guangtao Zhai; Xiongkuo Min

doi:10.1609/aaai.v40i27.39386

Authors

Ziheng Jia Shanghai Jiaotong University
Zicheng Zhang Shanghai Artificial Intelligence Laboratory
Xiaorong Zhu Shanghai Jiaotong University
Chunyi Li Shanghai Jiaotong University
Jinliang Han Shanghai Jiaotong University
Xiaohong Liu Shanghai Jiaotong University
Guangtao Zhai Shanghai Jiaotong University, Shanghai Artificial Intelligence Laboratory
Xiongkuo Min Shanghai Jiaotong University

DOI:

https://doi.org/10.1609/aaai.v40i27.39386

Abstract

The data scaling law has significantly enhanced large multi-modal models (LMMs) performance across various downstream tasks. However, in the domain of perceptual video quality assessment (VQA), the potential of data scaling remains unprecedented due to the scarcity of labeled resources and the insufficient scale of datasets. To address this, we propose OmniVQA, a framework designed to efficiently build high-quality, machine-dominated synthetic multi-modal instruction databases (MIDBs) for VQA. We then scale up to create OmniVQA-Chat-400K, the largest dataset in the VQA field concurrently. Our focus is on the technical and aesthetic quality dimensions, with abundant in-context instruction data to provide fine-grained VQA knowledge. Additionally, we build the OmniVQA-MOS-20K dataset to enhance the model's quantitative quality rating capabilities. We then introduce a complementary training strategy that effectively leverages the knowledge from datasets for different tasks. Furthermore, we propose the OmniVQA-FG (fine-grain)-Benchmark to evaluate the fine-grained performance of models. Our results demonstrate that our models achieve state-of-the-art performance in both tasks.

Scaling-up Perceptual Video Quality Assessment

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information