From Diagnosis to Generalization: A Cognitive Approach to Data Selection for Educational LLMs

Authors

  • Yuxiang Guo State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China
  • Yan Zhuang State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China
  • Qi Liu State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China
  • Zhenya Huang State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China
  • Xianquan Wang State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China
  • Liyang He State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China
  • Jiatong Li State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China
  • Rui Li State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China
  • Shijin Wang IFLYTEK Research

DOI:

https://doi.org/10.1609/aaai.v40i26.39300

Abstract

Specializing Large Language Models for educational domains is a key frontier in creating personalized learning tools. The central challenge is not data scarcity but its abundance: efficiently selecting a curated data subset from vast corpora to enhance specialized skills and foster generalization, without degrading existing abilities. Existing data selection paradigms, relying on superficial semantic similarity or model training dynamics, often lack a principled framework to identify data that promotes true cognitive growth. Our work proposes a paradigm shift from leveraging indirect proxies of learning value, such as semantic similarity and training dynamics, towards a framework that performs a direct, cognitive-level modeling of the learner's state. We introduce CASS, a novel framework that implements this cognitive approach through a clear pipeline, moving from an initial Diagnosis to the ultimate goal of expanding the model's cognitive frontier. First, CASS diagnoses the LLM's cognitive frontier using Multidimensional Item Response Theory. Leveraging this diagnosis, it then employs Fisher Information to select a data subset situated at LLM's cognitive frontier that offers maximum informational gain. Finally, the model is fine-tuned on this curated data using a structured, easy-to-hard curriculum to ensure effective learning. Experiments on our new multi-subject dataset show that models trained with CASS not only achieve superior accuracy in the target domain but also exhibit enhanced generalization. CASS provides a more efficient, effective, and theoretically-grounded paradigm for building expert educational LLMs.

Downloads

Published

2026-03-14

How to Cite

Guo, Y., Zhuang, Y., Liu, Q., Huang, Z., Wang, X., He, L., … Wang, S. (2026). From Diagnosis to Generalization: A Cognitive Approach to Data Selection for Educational LLMs. Proceedings of the AAAI Conference on Artificial Intelligence, 40(26), 21522–21530. https://doi.org/10.1609/aaai.v40i26.39300

Issue

Section

AAAI Technical Track on Machine Learning III