How (Un)ethical Are Instruction-Centric Responses of LLMs? Unveiling the Vulnerabilities of Safety Guardrails to Harmful Queries

Somnath Banerjee; Sayan Layek; Rima Hazra; Animesh Mukherjee

doi:10.1609/icwsm.v19i1.35811

Authors

Somnath Banerjee Indian Institute of Technology Kharagpur, India
Sayan Layek Indian Institute of Technology Kharagpur, India
Rima Hazra Singapore University of Technology and Design, Singapore
Animesh Mukherjee Indian Institute of Technology Kharagpur, India

DOI:

https://doi.org/10.1609/icwsm.v19i1.35811

Abstract

In this study, we tackle a growing concern around the safety and ethical use of large language models (LLMs). Despite their potential, these models can be tricked into producing harmful or unethical content through various sophisticated methods, including `jailbreaking' techniques and targeted manipulation. Our work zeroes in on a specific issue: to what extent LLMs can be led astray by asking them to generate responses that are instruction-centric such as a pseudocode, a program or a software snippet as opposed to vanilla text. To investigate this question, we introduce TechHazardQA, a dataset containing complex queries which should be answered in both text and instruction-centric formats (e.g., pseudocodes), aimed at identifying triggers for unethical responses. We query a series of LLMs -- Llama-2-13b, Llama-2-7b, Mistral-V2 and Mistral 8X7B -- and ask them to generate both text and instruction-centric responses. For evaluation we report the harmfulness score metric as well as judgements from GPT-4 and humans. Overall, we observe that asking LLMs to produce instruction-centric responses enhances the unethical response generation by 2-38% across the models. As an additional objective, we investigate the impact of model editing using the ROME technique, which further increases the propensity for generating undesirable content. We observe that the propensity to generate unethical content through instruction-centric responses in comparison to text responses increases significantly with a single edit, rising from an average of 18.9% to 56.7% in zero-shot scenarios, from 31.9% to 56.6% in zero-shot CoT, and from 22.8% to 65.7% in few-shot scenarios.

How (Un)ethical Are Instruction-Centric Responses of LLMs? Unveiling the Vulnerabilities of Safety Guardrails to Harmful Queries

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information