From Diagnosis to Generalization: A Cognitive Approach to Data Selection for Educational LLMs
DOI:
https://doi.org/10.1609/aaai.v40i26.39300Abstract
Specializing Large Language Models for educational domains is a key frontier in creating personalized learning tools. The central challenge is not data scarcity but its abundance: efficiently selecting a curated data subset from vast corpora to enhance specialized skills and foster generalization, without degrading existing abilities. Existing data selection paradigms, relying on superficial semantic similarity or model training dynamics, often lack a principled framework to identify data that promotes true cognitive growth. Our work proposes a paradigm shift from leveraging indirect proxies of learning value, such as semantic similarity and training dynamics, towards a framework that performs a direct, cognitive-level modeling of the learner's state. We introduce CASS, a novel framework that implements this cognitive approach through a clear pipeline, moving from an initial Diagnosis to the ultimate goal of expanding the model's cognitive frontier. First, CASS diagnoses the LLM's cognitive frontier using Multidimensional Item Response Theory. Leveraging this diagnosis, it then employs Fisher Information to select a data subset situated at LLM's cognitive frontier that offers maximum informational gain. Finally, the model is fine-tuned on this curated data using a structured, easy-to-hard curriculum to ensure effective learning. Experiments on our new multi-subject dataset show that models trained with CASS not only achieve superior accuracy in the target domain but also exhibit enhanced generalization. CASS provides a more efficient, effective, and theoretically-grounded paradigm for building expert educational LLMs.Published
2026-03-14
How to Cite
Guo, Y., Zhuang, Y., Liu, Q., Huang, Z., Wang, X., He, L., … Wang, S. (2026). From Diagnosis to Generalization: A Cognitive Approach to Data Selection for Educational LLMs. Proceedings of the AAAI Conference on Artificial Intelligence, 40(26), 21522–21530. https://doi.org/10.1609/aaai.v40i26.39300
Issue
Section
AAAI Technical Track on Machine Learning III