LAMDAS: LLM as an Implicit Classifier for Domain-specific Data Selection
DOI:
https://doi.org/10.1609/aaai.v40i40.40684Abstract
Adapting large language models (LLMs) to specific domains often faces a critical bottleneck: the scarcity of high-quality, human-curated data. While large volumes of unchecked data are readily available, indiscriminately using them for fine-tuning risks introducing noise and degrading performance. Strategic data selection is thus crucial, requiring a method that is both accurate and efficient. Existing approaches, categorized as similarity-based and direct optimization methods, struggle to simultaneously achieve these goals. In this paper, we introduce LAMDAS (LLM as an implicit classifier for domain-specific Data Selection), a novel approach that leverages the pre-trained LLM itself as an implicit classifier, thereby bypassing explicit feature engineering and computationally intensive optimization process. LAMDAS reframes data selection as a one-class classification problem, identifying candidate data that "belongs" to the target domain defined by a small reference dataset. Extensive experimental results demonstrate that LAMDAS not only exceeds the performance of full-data training using a fraction of the data but also outperforms nine state-of-the-art (SOTA) baselines under various scenarios. Furthermore, LAMDAS achieves the most compelling balance between performance gains and computational efficiency compared to all evaluated baselines.Downloads
Published
2026-03-14
How to Cite
Wu, J., Yu, H., Liu, B., Wenjie, Y., Di, P., Li, J., & Zhang, Y. (2026). LAMDAS: LLM as an Implicit Classifier for Domain-specific Data Selection. Proceedings of the AAAI Conference on Artificial Intelligence, 40(40), 33917–33925. https://doi.org/10.1609/aaai.v40i40.40684
Issue
Section
AAAI Technical Track on Natural Language Processing V