MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models

Authors

  • Pengfei Zhou National University of Singapore Shanghai Innovation Institute
  • Xiaopeng Peng Rochester Institute of Technology
  • Fanrui Zhang University of Science and Technology of China Shanghai Innovation Institute
  • Zhaopan Xu Harbin Institute of Technology Shanghai AI Laboratory
  • Jiaxin Ai Wuhan University Shanghai Innovation Institute
  • Yansheng Qiu Wuhan University Shanghai AI Laboratory
  • Wangbo Zhao National University of Singapore
  • Jiajun Song Renmin University of China
  • Chuanhao Li Shanghai AI Laboratory
  • Weidong Tang Xi'an University of Electronic Science and Technology
  • Zhen Li Shanghai AI Laboratory
  • Haoquan Zhang Shanghai AI Laboratory
  • Zizhen Li Shanghai Innovation Institute
  • Xiaofeng Mao Shanghai AI Laboratory
  • Yukang Feng Shanghai Innovation Institute
  • Jianwen Sun Shanghai Innovation Institute
  • Kai Wang National University of Singapore
  • Xiaojun Chang University of Science and Technology of China
  • Wenqi Shao Shanghai AI Laboratory
  • Yang You National University of Singapore
  • Kaipeng Zhang Shanghai AI Laboratory Shanghai Innovation Institute

DOI:

https://doi.org/10.1609/aaai.v40i34.40134

Abstract

Multimodal large language models (MLLMs), which integrate language and visual cues for problem-solving, are crucial for advancing artificial general intelligence (AGI). However, current benchmarks for measuring the intelligence of MLLMs suffer from limited scale, narrow coverage, and unstructured knowledge, offering only static and undifferentiated evaluations. To bridge this gap, we introduce MDK12-Bench, a large-scale multidisciplinary benchmark built from real-world K–12 exams spanning six disciplines with 141K instances and 6,225 knowledge points organized in a six-layer taxonomy. Covering five question formats with difficulty and year annotations, it enables comprehensive evaluation to capture the extent to which MLLMs perform over four dimensions: 1) difficulty levels, 2) temporal (cross-year) shifts, 3) contextual shifts, and 4) knowledge-driven reasoning. We propose a novel dynamic evaluation framework that introduces unfamiliar visual, textual, and question form shifts to challenge model generalization while improving benchmark objectivity and longevity by mitigating data contamination. We further evaluate knowledge-point reference-augmented generation (KP-RAG) to examine the role of knowledge in reasoning. Key findings reveal limitations in current MLLMs in multiple aspects and provide guidance for enhancing model reasoning, robustness, and AI-assisted education.

Downloads

Published

2026-03-14

How to Cite

Zhou, P., Peng, X., Zhang, F., Xu, Z., Ai, J., Qiu, Y., Zhao, W., Song, J., Li, C., Tang, W., Li, Z., Zhang, H., Li, Z., Mao, X., Feng, Y., Sun, J., Wang, K., Chang, X., Shao, W., You, Y., & Zhang, K. (2026). MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(34), 28982-28990. https://doi.org/10.1609/aaai.v40i34.40134

Issue

Section

AAAI Technical Track on Machine Learning XI