NoisyHate: Mining Online Human-Written Perturbations for Realistic Robustness Benchmarking of Content Moderation Models

Authors

  • Yiran Ye Pennsylvania State University
  • Thai Le Indiana University
  • Dongwon Lee Pennsylvania State University

DOI:

https://doi.org/10.1609/icwsm.v19i1.35961

Abstract

Online texts with toxic content are a clear threat to the users on social media in particular and society in general. Although many platforms have adopted various measures (e.g., machine learning-based hate-speech detection systems) to diminish their effect, toxic content writers have also attempted to evade such measures by using cleverly modified toxic words, so-called human-written text perturbations. Therefore, to help build automatic detection tools to recognize those perturbations, prior methods have developed sophisticated techniques to generate diverse adversarial samples. However, we note that these ``algorithm"-generated perturbations do not necessarily capture all the traits of ``human"-written perturbations. Therefore, in this paper, we introduce a novel, high-quality dataset of human-written perturbations, named as NoisyHate, that was created from real-life perturbations that are both written and verified by human-in-the-loop. We show that perturbations in NoisyHate have different characteristics than prior algorithm-generated toxic datasets show and thus can be particularly useful to help develop better toxic speech detection solutions. We also provide basic benchmark on the potential utilities of NoisyHate in perturbation normalization and understanding tasks. Both dataset and source code are publicly available.

Downloads

Published

2025-06-07

How to Cite

Ye, Y., Le, T., & Lee, D. (2025). NoisyHate: Mining Online Human-Written Perturbations for Realistic Robustness Benchmarking of Content Moderation Models. Proceedings of the International AAAI Conference on Web and Social Media, 19(1), 2603–2612. https://doi.org/10.1609/icwsm.v19i1.35961