Building Domain-Specific Small Language Models via Guided Data Generation

Authors

  • Aman Kumar Hitachi America, Ltd. R&D
  • Ekant Muljibhai Amin Hitachi, Ltd.
  • Xian Yeow Lee Hitachi America, Ltd. R&D
  • Lasitha Vidyaratne Hitachi America, Ltd. R&D
  • Ahmed K. Farahat Hitachi America, Ltd. R&D
  • Dipanjan D. Ghosh Hitachi America, Ltd. R&D
  • Yuta Koreeda Hitachi, Ltd.
  • Chetan Gupta Hitachi America, Ltd. R&D

DOI:

https://doi.org/10.1609/aaai.v40i47.41467

Abstract

Large Language Models (LLMs) have shown remarkable success in supporting a wide range of knowledge-intensive tasks. In specialized domains, there is growing interest in leveraging LLMs to assist subject matter experts with domain-specific challenges. However, deploying LLMs as SaaS solutions raises data privacy concerns, while many open-source models demand significant computational resources for effective domain adaptation and deployment. A promising alternative is to develop smaller, domain-specialized LLMs, though this approach is often constrained by the lack of high-quality domain-specific training data. In this work, we address these limitations by presenting a cost-efficient and scalable training pipeline that combines guided synthetic data generation from a small seed corpus with bottom-up domain data curation. Our pipeline integrates Domain-Adaptive Pretraining (DAPT), Domain-specific Supervised Fine-tuning (DSFT), and Direct Preference Optimization (DPO) to train effective small-scale models for specialized use cases. We demonstrate this approach through DiagnosticSLM, a 3B-parameter domain-specific model tailored for fault diagnosis, root cause analysis, and repair recommendation in industrial settings. To evaluate model performance, we introduce four domain-specific benchmarks: multiple-choice questions (DiagnosticMCQ), question answering (DiagnosticQA), sentence completion (DiagnosticComp), and summarization (DiagnosticSum). DiagnosticSLM achieves up to 25% accuracy improvement over open-source models of comparable or larger size (2B-9B) on the MCQ task, while also outperforming or matching them in other tasks, demonstrating effective domain-specific reasoning and generalization capabilities.

Downloads

Published

2026-03-14

How to Cite

Kumar, A., Amin, E. M., Lee, X. Y., Vidyaratne, L., Farahat, A. K., Ghosh, D. D., … Gupta, C. (2026). Building Domain-Specific Small Language Models via Guided Data Generation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(47), 40287–40294. https://doi.org/10.1609/aaai.v40i47.41467

Issue

Section

IAAI Technical Track on Emerging Applications of AI