Sahar Dataset: a Validated Dialogue Based Dataset For a Child-Centric, Empathetic and Knowledge-Driven Chatbot
DOI:
https://doi.org/10.1609/aaaiss.v6i1.36049Abstract
Artificial intelligence, particularly large language models (LLMs), has had a significant impact in many fields, including chatbots and virtual assistants. With the popularity of ChatGPT, the trend of human-AI collaboration through LLM based chatbots is growing, reaching an ever-expanding audience. A key group that requires special attention is children. A chatbot designed for children should be both knowledgeable and empathetic. While chatbots are essentially fine-tuned versions of LLMs, fine-tuning these models for this specific purpose presents a challenge due to the lack of readily available datasets that address both scientific queries and empathetic situations. This data shortage can be addressed by using generative AI techniques to create synthetic dataset samples. As such, we propose in this paper the use of ChatGPT prompting to generate the Sahar Dataset, a multi-turn student-centric chatbot interaction dataset that supports both STEAM and empathetic related dialogues. Our results show that the Sahar dataset is readable by 5th grade students according to the Flesch-Kincaid Grade score, while other popular datasets like Alpaca require a 9th grade reading level. Moreover, we obtained an IRB for human evaluations, and the results show that 90 percent of the dataset's STEAM is factual, and the empathetic dialogues lead to valid solutions to the child's problem 90 percent of the time.Downloads
Published
2025-08-01
How to Cite
Al Khansa, H., Mustapha, A., & Awad, M. (2025). Sahar Dataset: a Validated Dialogue Based Dataset For a Child-Centric, Empathetic and Knowledge-Driven Chatbot. Proceedings of the AAAI Symposium Series, 6(1), 159–166. https://doi.org/10.1609/aaaiss.v6i1.36049
Issue
Section
Human-AI Collaboration: Exploring Diversity of Human Cognitive Abilities and Varied AI Models for Hybrid Intelligent Systems