ArDia: Improving Arabic Dialectal Language Classification Using a Novel Dataset

Hossam Elsafty; Bouthaina Abdou; Tobias Deußer; Maren Pielka; Christian Bauckhage; Rafet Sifa

doi:10.1609/icwsm.v19i1.35944

ArDia: Improving Arabic Dialectal Language Classification Using a Novel Dataset

Authors

Hossam Elsafty University of Bonn
Bouthaina Abdou University of Bonn
Tobias Deußer University of Bonn Fraunhofer IAIS
Maren Pielka University of Bonn Fraunhofer IAIS
Christian Bauckhage University of Bonn Fraunhofer IAIS
Rafet Sifa University of Bonn Fraunhofer IAIS

DOI:

https://doi.org/10.1609/icwsm.v19i1.35944

Abstract

Despite Arabic being one of the most widely spoken languages, there is a scarcity of available dialectal Arabic data. In this paper, we address this challenge by proposing a novel approach to data collection through the main use of video captions from TikTok, and other resources such as dictionaries and articles, resulting in the creation of the ArDia dataset. To the best of our knowledge, the ArDia dataset is the largest labeled dialectal Arabic dataset, containing over 900,000 examples, each labeled with its respective dialect. We further leverage this dataset to pretrain transformer-based models, ArDiaBERT and ArDiaGPT. Due to a lack of research on the Arabic models, we present a comprehensive study of Arabic dialect identification using the ArDia dataset on the dialect identification task.

Downloads

Published

2025-06-07

How to Cite

Elsafty, H., Abdou, B., Deußer, T., Pielka, M., Bauckhage, C., & Sifa, R. (2025). ArDia: Improving Arabic Dialectal Language Classification Using a Novel Dataset. Proceedings of the International AAAI Conference on Web and Social Media, 19(1), 2413–2422. https://doi.org/10.1609/icwsm.v19i1.35944

Download Citation

Issue

Vol. 19 (2025): Proceedings of the Nineteenth International AAAI Conference on Web and Social Media

Section

Dataset Papers

ArDia: Improving Arabic Dialectal Language Classification Using a Novel Dataset

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information