ArDia: Improving Arabic Dialectal Language Classification Using a Novel Dataset
DOI:
https://doi.org/10.1609/icwsm.v19i1.35944Abstract
Despite Arabic being one of the most widely spoken languages, there is a scarcity of available dialectal Arabic data. In this paper, we address this challenge by proposing a novel approach to data collection through the main use of video captions from TikTok, and other resources such as dictionaries and articles, resulting in the creation of the ArDia dataset. To the best of our knowledge, the ArDia dataset is the largest labeled dialectal Arabic dataset, containing over 900,000 examples, each labeled with its respective dialect. We further leverage this dataset to pretrain transformer-based models, ArDiaBERT and ArDiaGPT. Due to a lack of research on the Arabic models, we present a comprehensive study of Arabic dialect identification using the ArDia dataset on the dialect identification task.Downloads
Published
2025-06-07
How to Cite
Elsafty, H., Abdou, B., Deußer, T., Pielka, M., Bauckhage, C., & Sifa, R. (2025). ArDia: Improving Arabic Dialectal Language Classification Using a Novel Dataset. Proceedings of the International AAAI Conference on Web and Social Media, 19(1), 2413–2422. https://doi.org/10.1609/icwsm.v19i1.35944
Issue
Section
Dataset Papers