Exploiting Explainability to Design Adversarial Attacks and Evaluate Attack Resilience in Hate-Speech Detection Models

Authors

  • Pranath Reddy Kumbam University of Florida
  • Sohaib Uddin Syed University of Florida
  • Prashanth Thamminedi University of Florida
  • Suhas Harish University of Florida
  • Ian Perera Institute for Human and Machine Cognition
  • Bonnie J Dorr University of Florida

DOI:

https://doi.org/10.1609/icwsm.v19i1.35859

Abstract

The advent of social media has given rise to numerous ethical challenges, with hate speech among the most significant concerns. Researchers are attempting to tackle this problem by using hate-speech detection and employing language models to automatically moderate content and promote civil discourse. Unfortunately, recent studies have revealed that hate-speech detection systems can be misled by adversarial attacks, raising concerns about their resilience. While previous research has separately addressed the robustness of these models under adversarial attacks and their explainability, there has been no comprehensive study exploring their intersection. The novelty of our work lies in combining these two critical aspects, leveraging explainability to identify potential vulnerabilities and enabling the design of targeted adversarial attacks. This paper quantifies the interplay between explainability and adversarial robustness in hate-speech detection models. We define novel metrics based on explainability-driven adversarial attacks to evaluate this relationship, providing a clear assessment of model vulnerabilities and guiding the development of more resilient systems.

Downloads

Published

2025-06-07

How to Cite

Kumbam, P. R., Syed, S. U., Thamminedi, P., Harish, S., Perera, I., & Dorr, B. J. (2025). Exploiting Explainability to Design Adversarial Attacks and Evaluate Attack Resilience in Hate-Speech Detection Models. Proceedings of the International AAAI Conference on Web and Social Media, 19(1), 1038–1050. https://doi.org/10.1609/icwsm.v19i1.35859