ArDia: Improving Arabic Dialectal Language Classification Using a Novel Dataset

Authors

  • Hossam Elsafty University of Bonn
  • Bouthaina Abdou University of Bonn
  • Tobias Deußer University of Bonn Fraunhofer IAIS
  • Maren Pielka University of Bonn Fraunhofer IAIS
  • Christian Bauckhage University of Bonn Fraunhofer IAIS
  • Rafet Sifa University of Bonn Fraunhofer IAIS

DOI:

https://doi.org/10.1609/icwsm.v19i1.35944

Abstract

Despite Arabic being one of the most widely spoken languages, there is a scarcity of available dialectal Arabic data. In this paper, we address this challenge by proposing a novel approach to data collection through the main use of video captions from TikTok, and other resources such as dictionaries and articles, resulting in the creation of the ArDia dataset. To the best of our knowledge, the ArDia dataset is the largest labeled dialectal Arabic dataset, containing over 900,000 examples, each labeled with its respective dialect. We further leverage this dataset to pretrain transformer-based models, ArDiaBERT and ArDiaGPT. Due to a lack of research on the Arabic models, we present a comprehensive study of Arabic dialect identification using the ArDia dataset on the dialect identification task.

Downloads

Published

2025-06-07

How to Cite

Elsafty, H., Abdou, B., Deußer, T., Pielka, M., Bauckhage, C., & Sifa, R. (2025). ArDia: Improving Arabic Dialectal Language Classification Using a Novel Dataset. Proceedings of the International AAAI Conference on Web and Social Media, 19(1), 2413–2422. https://doi.org/10.1609/icwsm.v19i1.35944