MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models

Pengfei Zhou; Xiaopeng Peng; Fanrui Zhang; Zhaopan Xu; Jiaxin Ai; Yansheng Qiu; Wangbo Zhao; Jiajun Song; Chuanhao Li; Weidong Tang; Zhen Li; Haoquan Zhang; Zizhen Li; Xiaofeng Mao; Yukang Feng; Jianwen Sun; Kai Wang; Xiaojun Chang; Wenqi Shao; Yang You; Kaipeng Zhang

doi:10.1609/aaai.v40i34.40134

Authors

Pengfei Zhou National University of Singapore Shanghai Innovation Institute
Xiaopeng Peng Rochester Institute of Technology
Fanrui Zhang University of Science and Technology of China Shanghai Innovation Institute
Zhaopan Xu Harbin Institute of Technology Shanghai AI Laboratory
Jiaxin Ai Wuhan University Shanghai Innovation Institute
Yansheng Qiu Wuhan University Shanghai AI Laboratory
Wangbo Zhao National University of Singapore
Jiajun Song Renmin University of China
Chuanhao Li Shanghai AI Laboratory
Weidong Tang Xi'an University of Electronic Science and Technology
Zhen Li Shanghai AI Laboratory
Haoquan Zhang Shanghai AI Laboratory
Zizhen Li Shanghai Innovation Institute
Xiaofeng Mao Shanghai AI Laboratory
Yukang Feng Shanghai Innovation Institute
Jianwen Sun Shanghai Innovation Institute
Kai Wang National University of Singapore
Xiaojun Chang University of Science and Technology of China
Wenqi Shao Shanghai AI Laboratory
Yang You National University of Singapore
Kaipeng Zhang Shanghai AI Laboratory Shanghai Innovation Institute

DOI:

https://doi.org/10.1609/aaai.v40i34.40134

Abstract

Multimodal large language models (MLLMs), which integrate language and visual cues for problem-solving, are crucial for advancing artificial general intelligence (AGI). However, current benchmarks for measuring the intelligence of MLLMs suffer from limited scale, narrow coverage, and unstructured knowledge, offering only static and undifferentiated evaluations. To bridge this gap, we introduce MDK12-Bench, a large-scale multidisciplinary benchmark built from real-world K–12 exams spanning six disciplines with 141K instances and 6,225 knowledge points organized in a six-layer taxonomy. Covering five question formats with difficulty and year annotations, it enables comprehensive evaluation to capture the extent to which MLLMs perform over four dimensions: 1) difficulty levels, 2) temporal (cross-year) shifts, 3) contextual shifts, and 4) knowledge-driven reasoning. We propose a novel dynamic evaluation framework that introduces unfamiliar visual, textual, and question form shifts to challenge model generalization while improving benchmark objectivity and longevity by mitigating data contamination. We further evaluate knowledge-point reference-augmented generation (KP-RAG) to examine the role of knowledge in reasoning. Key findings reveal limitations in current MLLMs in multiple aspects and provide guidance for enhancing model reasoning, robustness, and AI-assisted education.

MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information