From Solver to Tutor: Evaluating the Pedagogical Intelligence of LLMs with KMP-Bench

Weikang Shi; Houxing Ren; Junting Pan; Aojun Zhou; Ke Wang; Zimu Lu; Yunqiao Yang; Yuxuan Hu; Linda Wei; Mingjie Zhan; Hongsheng Li

doi:10.1609/aaai.v40i39.40578

Authors

Weikang Shi Multimedia Laboratory (MMLab), The Chinese University of Hong Kong
Houxing Ren Multimedia Laboratory (MMLab), The Chinese University of Hong Kong
Junting Pan Multimedia Laboratory (MMLab), The Chinese University of Hong Kong CPII under InnoHK
Aojun Zhou Multimedia Laboratory (MMLab), The Chinese University of Hong Kong
Ke Wang Multimedia Laboratory (MMLab), The Chinese University of Hong Kong
Zimu Lu Multimedia Laboratory (MMLab), The Chinese University of Hong Kong
Yunqiao Yang Multimedia Laboratory (MMLab), The Chinese University of Hong Kong
Yuxuan Hu Multimedia Laboratory (MMLab), The Chinese University of Hong Kong
Linda Wei Multimedia Laboratory (MMLab), The Chinese University of Hong Kong
Mingjie Zhan Multimedia Laboratory (MMLab), The Chinese University of Hong Kong
Hongsheng Li Multimedia Laboratory (MMLab), The Chinese University of Hong Kong Shanghai Artificial Intelligence Laboratory CPII under InnoHK

DOI:

https://doi.org/10.1609/aaai.v40i39.40578

Abstract

Large Language Models (LLMs) show significant potential in AI mathematical tutoring, yet current evaluations often rely on simplistic metrics or narrow pedagogical scenarios, failing to assess comprehensive, multi-turn teaching effectiveness. In this paper, we introduce KMP-Bench, a comprehensive K-8 Mathematical Pedagogical Benchmark designed to assess LLMs from two complementary perspectives. The first module, KMP-Dialogue, evaluates holistic pedagogical capabilities against six core principles (e.g., Challenge, Explanation, Feedback), leveraging a novel multi-turn dialogue dataset constructed by weaving together diverse pedagogical components. The second module, KMP-Skills, provides a granular assessment of foundational tutoring abilities, including multi-turn problem-solving, error detection and correction, and problem generation. Our evaluations on KMP-Bench reveal a key disparity: while leading LLMs excel at tasks with verifiable solutions, they struggle with the nuanced application of pedagogical principles. Additionally, we present KMP-Pile, a large-scale (150K) dialogue dataset. Models fine-tuned on KMP-Pile show substantial improvement on KMP-Bench, underscoring the value of pedagogically-rich training data for developing more effective AI math tutors.

From Solver to Tutor: Evaluating the Pedagogical Intelligence of LLMs with KMP-Bench

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information