Spanning the Spectrum of Hatred Detection: A Persian Multi-Label Hate Speech Dataset with Annotator Rationales

Authors

  • Zahra Delbari Tehran Institute for Advanced Studies
  • Nafise Sadat Moosavi Department of Computer Science, University of Sheffield
  • Mohammad Taher Pilehvar Cardiff University

DOI:

https://doi.org/10.1609/aaai.v38i16.29743

Keywords:

NLP: Other, NLP: Applications

Abstract

With the alarming rise of hate speech in online communities, the demand for effective NLP models to identify instances of offensive language has reached a critical point. However, the development of such models heavily relies on the availability of annotated datasets, which are scarce, particularly for less-studied languages. To bridge this gap for the Persian language, we present a novel dataset specifically tailored to multi-label hate speech detection. Our dataset, called Phate, consists of an extensive collection of over seven thousand manually-annotated Persian tweets, offering a rich resource for training and evaluating hate speech detection models on this language. Notably, each annotation in our dataset specifies the targeted group of hate speech and includes a span of the tweet which elucidates the rationale behind the assigned label. The incorporation of these information expands the potential applications of our dataset, facilitating the detection of targeted online harm or allowing the benchmark to serve research on interpretability of hate speech detection models. The dataset, annotation guideline, and all associated codes are accessible at https://github.com/Zahra-D/Phate.

Downloads

Published

2024-03-24

How to Cite

Delbari, Z. ., Moosavi, N. S., & Pilehvar, M. T. (2024). Spanning the Spectrum of Hatred Detection: A Persian Multi-Label Hate Speech Dataset with Annotator Rationales. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16), 17889-17897. https://doi.org/10.1609/aaai.v38i16.29743

Issue

Section

AAAI Technical Track on Natural Language Processing I