FuzzE: Fuzzy Fairness Evaluation of Offensive Language Classifiers on African-American English
Hate speech and offensive language are rampant on social media. Machine learning has provided a way to moderate foul language at scale. However, much of the current research focuses on overall performance. Models may perform poorly on text written in a minority dialectal language. For instance, a hate speech classifier may produce more false positives on tweets written in African-American Vernacular English (AAVE). To measure these problems, we need text written in both AAVE and Standard American English (SAE). Unfortunately, it is challenging to curate data for all linguistic styles in a timely manner—especially when we are constrained to specific problems, social media platforms, or by limited resources. In this paper, we answer the question, “How can we evaluate the performance of classifiers across minority dialectal languages when they are not present within a particular dataset?” Specifically, we propose an automated fairness fuzzing tool called FuzzE to quantify the fairness of text classifiers applied to AAVE text using a dataset that only contains text written in SAE. Overall, we find that the fairness estimates returned by our technique moderately correlates with the use of real ground-truth AAVE text. Warning: Offensive language is displayed in this manuscript.