From Diagnosis to Generalization: A Cognitive Approach to Data Selection for Educational LLMs

Yuxiang Guo; Yan Zhuang; Qi Liu; Zhenya Huang; Xianquan Wang; Liyang He; Jiatong Li; Rui Li; Shijin Wang

doi:10.1609/aaai.v40i26.39300

Authors

Yuxiang Guo State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China
Yan Zhuang State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China
Qi Liu State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China
Zhenya Huang State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China
Xianquan Wang State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China
Liyang He State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China
Jiatong Li State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China
Rui Li State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China
Shijin Wang IFLYTEK Research

DOI:

https://doi.org/10.1609/aaai.v40i26.39300

Abstract

Specializing Large Language Models for educational domains is a key frontier in creating personalized learning tools. The central challenge is not data scarcity but its abundance: efficiently selecting a curated data subset from vast corpora to enhance specialized skills and foster generalization, without degrading existing abilities. Existing data selection paradigms, relying on superficial semantic similarity or model training dynamics, often lack a principled framework to identify data that promotes true cognitive growth. Our work proposes a paradigm shift from leveraging indirect proxies of learning value, such as semantic similarity and training dynamics, towards a framework that performs a direct, cognitive-level modeling of the learner's state. We introduce CASS, a novel framework that implements this cognitive approach through a clear pipeline, moving from an initial Diagnosis to the ultimate goal of expanding the model's cognitive frontier. First, CASS diagnoses the LLM's cognitive frontier using Multidimensional Item Response Theory. Leveraging this diagnosis, it then employs Fisher Information to select a data subset situated at LLM's cognitive frontier that offers maximum informational gain. Finally, the model is fine-tuned on this curated data using a structured, easy-to-hard curriculum to ensure effective learning. Experiments on our new multi-subject dataset show that models trained with CASS not only achieve superior accuracy in the target domain but also exhibit enhanced generalization. CASS provides a more efficient, effective, and theoretically-grounded paradigm for building expert educational LLMs.

From Diagnosis to Generalization: A Cognitive Approach to Data Selection for Educational LLMs

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information