Improving Zero-Shot Cross-Lingual Hate Speech Detection with Pseudo-Label Fine-Tuning of Transformer Language Models

Authors

  • Haris Bin Zia Queen Mary University of London
  • Ignacio Castro Queen Mary University of London
  • Arkaitz Zubiaga Queen Mary University of London
  • Gareth Tyson Queen Mary University of London

Keywords:

Text categorization; topic recognition; demographic/gender/age identification, Subjectivity in textual data; sentiment analysis; polarity/opinion identification and extraction, linguistic analyses of social media behavior

Abstract

Hate speech has proliferated on social media platforms in recent years. While this has been the focus of many studies, most works have exclusively focused on a single language, generally English. Low-resourced languages have been neglected due to the dearth of labeled resources. These languages, however, represent an important portion of the data due to the multilingual nature of social media. This work presents a novel zero-shot, cross-lingual transfer learning pipeline based on pseudo-label fine-tuning of Transformer Language Models for automatic hate speech detection. We employ our pipeline on benchmark datasets covering English (source) and 6 different non-English (target) languages written in 3 different scripts. Our pipeline achieves an average improvement of 7.6% (in terms of macro-F1) over previous zero-shot, cross-lingual models. This demonstrates the feasibility of high accuracy automatic hate speech detection for low-resource languages. We release our code and models at https://github.com/harisbinzia/ZeroshotCrosslingualHateSpeech.

Downloads

Published

2022-05-31

How to Cite

Zia, H. B., Castro, I., Zubiaga, A., & Tyson, G. (2022). Improving Zero-Shot Cross-Lingual Hate Speech Detection with Pseudo-Label Fine-Tuning of Transformer Language Models. Proceedings of the International AAAI Conference on Web and Social Media, 16(1), 1435-1439. Retrieved from https://ojs.aaai.org/index.php/ICWSM/article/view/19402