Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation

Jaewoo Park; Jungyang Park; Dongju Jang; Jiwan Chung; Byungwoo Yoo; Jaewoo Shin; Seonjoon Park; Taehyeong Kim; Youngjae Yu

doi:10.1609/aaai.v40i38.40544

Authors

Jaewoo Park Yonsei University
Jungyang Park Yonsei University Mathpresso
Dongju Jang Yonsei University
Jiwan Chung Yonsei University
Byungwoo Yoo Mathpresso
Jaewoo Shin Mathpresso
Seonjoon Park Mathpresso
Taehyeong Kim Mathpresso
Youngjae Yu Seoul National University

DOI:

https://doi.org/10.1609/aaai.v40i38.40544

Abstract

With the rapid advancement of mathematical reasoning capabilities in Large Language Models (LLMs), AI systems are increasingly being adopted in educational settings to support students’ comprehension of problem-solving processes. However, a critical component remains underexplored in current LLM-generated explanations: multimodal explanation. In real-world instructional contexts, human tutors routinely employ visual aids, such as diagrams, markings, and highlights, to enhance conceptual clarity. To bridge this gap, we introduce the multimodal solution explanation task, designed to evaluate whether models can identify visual keypoints, such as auxiliary lines, points, angles, and generate explanations that incorporate these key elements essential for understanding. To evaluate model performance on this task, we propose ME2, a multimodal benchmark consisting of 1,000 math problems annotated with visual keypoints and corresponding explanatory text that references those elements. Our empirical results show that current models struggle to identify visual keypoints. In the task of generating keypoint-based explanations, open-source models also face notable difficulties. This highlights a significant gap in current LLMs’ ability to perform mathematical visual grounding, engage in visually grounded reasoning, and provide explanations in educational contexts. We expect that the multimodal solution explanation task and the ME2 dataset will catalyze further research on LLMs in education and promote their use as effective, explanation-oriented AI tutors.

Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information