RTP-LX: Can LLMs Evaluate Toxicity in Multilingual Scenarios?

Authors

  • Adrian de Wynter Microsoft The University of York
  • Ishaan Watts Microsoft
  • Tua Wongsangaroonsri Microsoft
  • Minghui Zhang Microsoft
  • Noura Farra Microsoft
  • Nektar Ege Altıntoprak Microsoft
  • Lena Baur Microsoft
  • Samantha Claudet Microsoft
  • Pavel Gajdušek Microsoft
  • Qilong Gu Microsoft
  • Anna Kaminska Microsoft
  • Tomasz Kaminski Microsoft
  • Ruby Kuo Microsoft
  • Akiko Kyuba Microsoft
  • Jongho Lee Microsoft
  • Kartik Mathur Microsoft
  • Petter Merok Microsoft
  • Ivana Milovanović Microsoft
  • Nani Paananen Microsoft
  • Vesa-Matti Paananen Microsoft
  • Anna Pavlenko Microsoft
  • Bruno Pereira Vidal Microsoft
  • Luciano Ivan Strika Microsoft
  • Yueh Tsao Microsoft
  • Davide Turcato Microsoft
  • Oleksandr Vakhno Microsoft
  • Judit Velcsov Microsoft
  • Anna Vickers Microsoft
  • Stéphanie F. Visser Microsoft
  • Herdyan Widarmanto Microsoft
  • Andrey Zaikin Microsoft
  • Si-Qing Chen Microsoft

DOI:

https://doi.org/10.1609/aaai.v39i27.35011

Abstract

Large language models (LLMs) and small language models (SLMs) are being adopted at remarkable speed, although their safety still remains a serious concern. With the advent of multilingual S/LLMs, the question now becomes a matter of scale: can we expand multilingual safety evaluations of these models with the same velocity at which they are deployed? To this end, we introduce RTP-LX, a human-transcreated and human-annotated corpus of toxic prompts and outputs in 28 languages. RTP-LX follows participatory design practices, and a portion of the corpus is especially designed to detect culturally-specific toxic language. We evaluate 10 S/LLMs on their ability to detect toxic content in a culturally-sensitive, multilingual scenario. We find that, although they typically score acceptably in terms of accuracy, they have low agreement with human judges when scoring holistically the toxicity of a prompt; and have difficulty discerning harm in context-dependent scenarios, particularly with subtle-yet-harmful content (e.g. microaggressions, bias). We release this dataset to contribute to further reduce harmful uses of these models and improve their safe deployment.

Published

2025-04-11

How to Cite

de Wynter, A., Watts, I., Wongsangaroonsri, T., Zhang, M., Farra, N., Altıntoprak, N. E., … Chen, S.-Q. (2025). RTP-LX: Can LLMs Evaluate Toxicity in Multilingual Scenarios?. Proceedings of the AAAI Conference on Artificial Intelligence, 39(27), 27940–27950. https://doi.org/10.1609/aaai.v39i27.35011