Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation

Authors

  • Jaewoo Park Yonsei University
  • Jungyang Park Yonsei University Mathpresso
  • Dongju Jang Yonsei University
  • Jiwan Chung Yonsei University
  • Byungwoo Yoo Mathpresso
  • Jaewoo Shin Mathpresso
  • Seonjoon Park Mathpresso
  • Taehyeong Kim Mathpresso
  • Youngjae Yu Seoul National University

DOI:

https://doi.org/10.1609/aaai.v40i38.40544

Abstract

With the rapid advancement of mathematical reasoning capabilities in Large Language Models (LLMs), AI systems are increasingly being adopted in educational settings to support students’ comprehension of problem-solving processes. However, a critical component remains underexplored in current LLM-generated explanations: multimodal explanation. In real-world instructional contexts, human tutors routinely employ visual aids, such as diagrams, markings, and highlights, to enhance conceptual clarity. To bridge this gap, we introduce the multimodal solution explanation task, designed to evaluate whether models can identify visual keypoints, such as auxiliary lines, points, angles, and generate explanations that incorporate these key elements essential for understanding. To evaluate model performance on this task, we propose ME2, a multimodal benchmark consisting of 1,000 math problems annotated with visual keypoints and corresponding explanatory text that references those elements. Our empirical results show that current models struggle to identify visual keypoints. In the task of generating keypoint-based explanations, open-source models also face notable difficulties. This highlights a significant gap in current LLMs’ ability to perform mathematical visual grounding, engage in visually grounded reasoning, and provide explanations in educational contexts. We expect that the multimodal solution explanation task and the ME2 dataset will catalyze further research on LLMs in education and promote their use as effective, explanation-oriented AI tutors.

Published

2026-03-14

How to Cite

Park, J., Park, J., Jang, D., Chung, J., Yoo, B., Shin, J., … Yu, Y. (2026). Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(38), 32664–32672. https://doi.org/10.1609/aaai.v40i38.40544

Issue

Section

AAAI Technical Track on Natural Language Processing III