Exploiting Explainability to Design Adversarial Attacks and Evaluate Attack Resilience in Hate-Speech Detection Models

Pranath Reddy Kumbam; Sohaib Uddin Syed; Prashanth Thamminedi; Suhas Harish; Ian Perera; Bonnie J Dorr

doi:10.1609/icwsm.v19i1.35859

Exploiting Explainability to Design Adversarial Attacks and Evaluate Attack Resilience in Hate-Speech Detection Models

Authors

Pranath Reddy Kumbam University of Florida
Sohaib Uddin Syed University of Florida
Prashanth Thamminedi University of Florida
Suhas Harish University of Florida
Ian Perera Institute for Human and Machine Cognition
Bonnie J Dorr University of Florida

DOI:

https://doi.org/10.1609/icwsm.v19i1.35859

Abstract

The advent of social media has given rise to numerous ethical challenges, with hate speech among the most significant concerns. Researchers are attempting to tackle this problem by using hate-speech detection and employing language models to automatically moderate content and promote civil discourse. Unfortunately, recent studies have revealed that hate-speech detection systems can be misled by adversarial attacks, raising concerns about their resilience. While previous research has separately addressed the robustness of these models under adversarial attacks and their explainability, there has been no comprehensive study exploring their intersection. The novelty of our work lies in combining these two critical aspects, leveraging explainability to identify potential vulnerabilities and enabling the design of targeted adversarial attacks. This paper quantifies the interplay between explainability and adversarial robustness in hate-speech detection models. We define novel metrics based on explainability-driven adversarial attacks to evaluate this relationship, providing a clear assessment of model vulnerabilities and guiding the development of more resilient systems.

Downloads

Published

2025-06-07

How to Cite

Kumbam, P. R., Syed, S. U., Thamminedi, P., Harish, S., Perera, I., & Dorr, B. J. (2025). Exploiting Explainability to Design Adversarial Attacks and Evaluate Attack Resilience in Hate-Speech Detection Models. Proceedings of the International AAAI Conference on Web and Social Media, 19(1), 1038–1050. https://doi.org/10.1609/icwsm.v19i1.35859

Download Citation

Issue

Vol. 19 (2025): Proceedings of the Nineteenth International AAAI Conference on Web and Social Media

Section

Full Papers

Exploiting Explainability to Design Adversarial Attacks and Evaluate Attack Resilience in Hate-Speech Detection Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information