K-12EduBench: A Benchmark for Evaluating Large Language Models’ Knowledge, Problem-Solving, and Educational Goal Cognition in K-12 Education

Yuqing Ye; Xuan Zhou; Zhifu Chen; Dandan Li; Hengnian Gu; Jin Peng Zhou; Dongdai Zhou

doi:10.1609/aaai.v40i40.40744

Authors

Yuqing Ye Northeast Normal University, Changchun, Jilin, China
Xuan Zhou Northeast Normal University, Changchun, Jilin, China
Zhifu Chen Northeast Normal University, Changchun, Jilin, China
Dandan Li Northeast Normal University, Changchun, Jilin, China
Hengnian Gu Northeast Normal University, Changchun, Jilin, China
Jin Peng Zhou Cornell University, Ithaca, New York, United States
Dongdai Zhou Northeast Normal University, Changchun, Jilin, China

DOI:

https://doi.org/10.1609/aaai.v40i40.40744

Abstract

Large language models hold great promise for transforming K-12 education, but there is an urgent need for systematic evaluation of their core educational capabilities. Existing benchmarks often overlook educational goal cognition and overemphasize answer accuracy, thereby failing to capture deeper subject-level knowledge ability and problem-solving ability. To address this gap, we introduce K-12EduBench: a benchmark for evaluating LLMs’ subject-level knowledge ability, subject-specific problem-solving ability, and educational goal cognition ability in K-12 education. K-12EduBench comprises four components: (1) a dataset of 2,640 objective and 619 subjective questions across nine subjects, annotated with answers, problem-solving processes, and cognitive-level labels; (2) nine Item Response Theory (IRT) models for estimating subject-level knowledge ability; (3) evaluation methods and metrics for assessing multi-step problem-solving ability; and (4) prompts and scoring rubrics for measuring alignment with target cognitive levels. Experiments on advanced LLMs show that education-optimized models consistently outperform general-purpose ones across all three abilities, while under-scaled models lag substantially. We observe a strong positive correlation between subject-level knowledge ability and subject-specific problem-solving ability. Despite gains in educational goal cognition ability, current models—even those tailored for education—still fall short of real-world instructional needs.

K-12EduBench: A Benchmark for Evaluating Large Language Models’ Knowledge, Problem-Solving, and Educational Goal Cognition in K-12 Education

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information