K-12EduBench: A Benchmark for Evaluating Large Language Models’ Knowledge, Problem-Solving, and Educational Goal Cognition in K-12 Education

Authors

  • Yuqing Ye Northeast Normal University, Changchun, Jilin, China
  • Xuan Zhou Northeast Normal University, Changchun, Jilin, China
  • Zhifu Chen Northeast Normal University, Changchun, Jilin, China
  • Dandan Li Northeast Normal University, Changchun, Jilin, China
  • Hengnian Gu Northeast Normal University, Changchun, Jilin, China
  • Jin Peng Zhou Cornell University, Ithaca, New York, United States
  • Dongdai Zhou Northeast Normal University, Changchun, Jilin, China

DOI:

https://doi.org/10.1609/aaai.v40i40.40744

Abstract

Large language models hold great promise for transforming K-12 education, but there is an urgent need for systematic evaluation of their core educational capabilities. Existing benchmarks often overlook educational goal cognition and overemphasize answer accuracy, thereby failing to capture deeper subject-level knowledge ability and problem-solving ability. To address this gap, we introduce K-12EduBench: a benchmark for evaluating LLMs’ subject-level knowledge ability, subject-specific problem-solving ability, and educational goal cognition ability in K-12 education. K-12EduBench comprises four components: (1) a dataset of 2,640 objective and 619 subjective questions across nine subjects, annotated with answers, problem-solving processes, and cognitive-level labels; (2) nine Item Response Theory (IRT) models for estimating subject-level knowledge ability; (3) evaluation methods and metrics for assessing multi-step problem-solving ability; and (4) prompts and scoring rubrics for measuring alignment with target cognitive levels. Experiments on advanced LLMs show that education-optimized models consistently outperform general-purpose ones across all three abilities, while under-scaled models lag substantially. We observe a strong positive correlation between subject-level knowledge ability and subject-specific problem-solving ability. Despite gains in educational goal cognition ability, current models—even those tailored for education—still fall short of real-world instructional needs.

Downloads

Published

2026-03-14

How to Cite

Ye, Y., Zhou, X., Chen, Z., Li, D., Gu, H., Zhou, J. P., & Zhou, D. (2026). K-12EduBench: A Benchmark for Evaluating Large Language Models’ Knowledge, Problem-Solving, and Educational Goal Cognition in K-12 Education. Proceedings of the AAAI Conference on Artificial Intelligence, 40(40), 34459–34466. https://doi.org/10.1609/aaai.v40i40.40744

Issue

Section

AAAI Technical Track on Natural Language Processing V