Towards Generalization of Machine Learning Models: A Case Study of Arabic Sentiment Analysis

Samir Abdaljalil; Shaimaa Hassanein; Hamdy Mubarak; Ahmed Abdelali

doi:10.1609/icwsm.v17i1.22204

Authors

Samir Abdaljalil Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
Shaimaa Hassanein Zewail City of Science and Technology, Giza, Egypt
Hamdy Mubarak Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
Ahmed Abdelali Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar

DOI:

https://doi.org/10.1609/icwsm.v17i1.22204

Keywords:

, Subjectivity in textual data; sentiment analysis; polarity/opinion identification and extraction, linguistic analyses of social media behavior, Web and Social Media

Abstract

The abundance of social media data in the Arab world, specifically on Twitter, enabled companies and entities to exploit such rich and beneficial data that could be mined and used to extract important information, including sentiments and opinions of people towards a topic or a merchandise. However, with this plenitude comes the issue of producing models that are able to deliver consistent outcomes when tested within various contexts. Although model generalization has been thoroughly investigated in many fields, it has not been heavily investigated in the Arabic context. To address this gap, we investigate the generalization of models and data in Arabic with application to sentiment analysis, by performing a battery of experiments and building different models that are tested on five independent test sets to understand their performance when presented with unseen data. In doing so, we detail different techniques that improve the generalization of machine learning models in Arabic sentiment analysis, and share a large versatile dataset consisting of approximately 1.64M Arabic tweets and their corresponding sentiment to be used for future research. Our experiments concluded that the most consistent model is trained using a dataset labelled by a cascaded approach of two models, one that labels neutral tweets and another that identifies positive/negative tweets based on the Arabic emoji lexicon after class balancing. Both the BERT and the SVM models trained using the refined data achieve an average F-1 score of 0.62 and 0.60, and standard deviation of 0.06 and 0.04 respectively, when evaluated on five diverse test sets, outperforming other models by at least 17% relative gain in F-1. Based on our experiments, we share recommendations to improve model generalization for classification tasks.

Towards Generalization of Machine Learning Models: A Case Study of Arabic Sentiment Analysis

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information