Building Domain-Specific Small Language Models via Guided Data Generation

Aman Kumar; Ekant Muljibhai Amin; Xian Yeow Lee; Lasitha Vidyaratne; Ahmed K. Farahat; Dipanjan D. Ghosh; Yuta Koreeda; Chetan Gupta

doi:10.1609/aaai.v40i47.41467

Authors

Aman Kumar Hitachi America, Ltd. R&D
Ekant Muljibhai Amin Hitachi, Ltd.
Xian Yeow Lee Hitachi America, Ltd. R&D
Lasitha Vidyaratne Hitachi America, Ltd. R&D
Ahmed K. Farahat Hitachi America, Ltd. R&D
Dipanjan D. Ghosh Hitachi America, Ltd. R&D
Yuta Koreeda Hitachi, Ltd.
Chetan Gupta Hitachi America, Ltd. R&D

DOI:

https://doi.org/10.1609/aaai.v40i47.41467

Abstract

Large Language Models (LLMs) have shown remarkable success in supporting a wide range of knowledge-intensive tasks. In specialized domains, there is growing interest in leveraging LLMs to assist subject matter experts with domain-specific challenges. However, deploying LLMs as SaaS solutions raises data privacy concerns, while many open-source models demand significant computational resources for effective domain adaptation and deployment. A promising alternative is to develop smaller, domain-specialized LLMs, though this approach is often constrained by the lack of high-quality domain-specific training data. In this work, we address these limitations by presenting a cost-efficient and scalable training pipeline that combines guided synthetic data generation from a small seed corpus with bottom-up domain data curation. Our pipeline integrates Domain-Adaptive Pretraining (DAPT), Domain-specific Supervised Fine-tuning (DSFT), and Direct Preference Optimization (DPO) to train effective small-scale models for specialized use cases. We demonstrate this approach through DiagnosticSLM, a 3B-parameter domain-specific model tailored for fault diagnosis, root cause analysis, and repair recommendation in industrial settings. To evaluate model performance, we introduce four domain-specific benchmarks: multiple-choice questions (DiagnosticMCQ), question answering (DiagnosticQA), sentence completion (DiagnosticComp), and summarization (DiagnosticSum). DiagnosticSLM achieves up to 25% accuracy improvement over open-source models of comparable or larger size (2B-9B) on the MCQ task, while also outperforming or matching them in other tasks, demonstrating effective domain-specific reasoning and generalization capabilities.

Building Domain-Specific Small Language Models via Guided Data Generation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information