Sahar Dataset: a Validated Dialogue Based Dataset For a Child-Centric, Empathetic and Knowledge-Driven Chatbot

Hadi Al Khansa; Ahmad Mustapha; Mariette Awad

doi:10.1609/aaaiss.v6i1.36049

Authors

Hadi Al Khansa American University of Beirut
Ahmad Mustapha American University of Beirut
Mariette Awad American University of Beirut

DOI:

https://doi.org/10.1609/aaaiss.v6i1.36049

Abstract

Artificial intelligence, particularly large language models (LLMs), has had a significant impact in many fields, including chatbots and virtual assistants. With the popularity of ChatGPT, the trend of human-AI collaboration through LLM based chatbots is growing, reaching an ever-expanding audience. A key group that requires special attention is children. A chatbot designed for children should be both knowledgeable and empathetic. While chatbots are essentially fine-tuned versions of LLMs, fine-tuning these models for this specific purpose presents a challenge due to the lack of readily available datasets that address both scientific queries and empathetic situations. This data shortage can be addressed by using generative AI techniques to create synthetic dataset samples. As such, we propose in this paper the use of ChatGPT prompting to generate the Sahar Dataset, a multi-turn student-centric chatbot interaction dataset that supports both STEAM and empathetic related dialogues. Our results show that the Sahar dataset is readable by 5th grade students according to the Flesch-Kincaid Grade score, while other popular datasets like Alpaca require a 9th grade reading level. Moreover, we obtained an IRB for human evaluations, and the results show that 90 percent of the dataset's STEAM is factual, and the empathetic dialogues lead to valid solutions to the child's problem 90 percent of the time.

Sahar Dataset: a Validated Dialogue Based Dataset For a Child-Centric, Empathetic and Knowledge-Driven Chatbot

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information