RTP-LX: Can LLMs Evaluate Toxicity in Multilingual Scenarios?

Adrian de Wynter; Ishaan Watts; Tua Wongsangaroonsri; Minghui Zhang; Noura Farra; Nektar Ege Altıntoprak; Lena Baur; Samantha Claudet; Pavel Gajdušek; Qilong Gu; Anna Kaminska; Tomasz Kaminski; Ruby Kuo; Akiko Kyuba; Jongho Lee; Kartik Mathur; Petter Merok; Ivana Milovanović; Nani Paananen; Vesa-Matti Paananen; Anna Pavlenko; Bruno Pereira Vidal; Luciano Ivan Strika; Yueh Tsao; Davide Turcato; Oleksandr Vakhno; Judit Velcsov; Anna Vickers; Stéphanie F. Visser; Herdyan Widarmanto; Andrey Zaikin; Si-Qing Chen

doi:10.1609/aaai.v39i27.35011

Authors

Adrian de Wynter Microsoft The University of York
Ishaan Watts Microsoft
Tua Wongsangaroonsri Microsoft
Minghui Zhang Microsoft
Noura Farra Microsoft
Nektar Ege Altıntoprak Microsoft
Lena Baur Microsoft
Samantha Claudet Microsoft
Pavel Gajdušek Microsoft
Qilong Gu Microsoft
Anna Kaminska Microsoft
Tomasz Kaminski Microsoft
Ruby Kuo Microsoft
Akiko Kyuba Microsoft
Jongho Lee Microsoft
Kartik Mathur Microsoft
Petter Merok Microsoft
Ivana Milovanović Microsoft
Nani Paananen Microsoft
Vesa-Matti Paananen Microsoft
Anna Pavlenko Microsoft
Bruno Pereira Vidal Microsoft
Luciano Ivan Strika Microsoft
Yueh Tsao Microsoft
Davide Turcato Microsoft
Oleksandr Vakhno Microsoft
Judit Velcsov Microsoft
Anna Vickers Microsoft
Stéphanie F. Visser Microsoft
Herdyan Widarmanto Microsoft
Andrey Zaikin Microsoft
Si-Qing Chen Microsoft

DOI:

https://doi.org/10.1609/aaai.v39i27.35011

Abstract

Large language models (LLMs) and small language models (SLMs) are being adopted at remarkable speed, although their safety still remains a serious concern. With the advent of multilingual S/LLMs, the question now becomes a matter of scale: can we expand multilingual safety evaluations of these models with the same velocity at which they are deployed? To this end, we introduce RTP-LX, a human-transcreated and human-annotated corpus of toxic prompts and outputs in 28 languages. RTP-LX follows participatory design practices, and a portion of the corpus is especially designed to detect culturally-specific toxic language. We evaluate 10 S/LLMs on their ability to detect toxic content in a culturally-sensitive, multilingual scenario. We find that, although they typically score acceptably in terms of accuracy, they have low agreement with human judges when scoring holistically the toxicity of a prompt; and have difficulty discerning harm in context-dependent scenarios, particularly with subtle-yet-harmful content (e.g. microaggressions, bias). We release this dataset to contribute to further reduce harmful uses of these models and improve their safe deployment.

RTP-LX: Can LLMs Evaluate Toxicity in Multilingual Scenarios?

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information