Investigating the Security Threat Arising from “Yes-No” Implicit Bias in Large Language Models

Yanrui Du; Sendong Zhao; Ming Ma; Yuhan Chen; Bing Qin

doi:10.1609/aaai.v39i22.34554

Authors

Yanrui Du Harbin Institute of Technology
Sendong Zhao Harbin Institute of Technology
Ming Ma Harbin Institute of Technology
Yuhan Chen Harbin Institute of Technology
Bing Qin Harbin Institute of Technology

DOI:

https://doi.org/10.1609/aaai.v39i22.34554

Abstract

Large Language Models (LLMs) have gained significant attention for their exceptional performance across various domains. Despite their advancements, concerns persist regarding their implicit bias, which often leads to negative social impacts. Therefore, it is essential to identify the implicit bias in LLMs and investigate the potential threat posed by it. Our study focused on a specific type of implicit bias, termed the ''Yes-No'' implicit bias, which refers to LLMs' inherent tendency to favor ''Yes'' or ''No'' responses to a single instruction. By comparing the probability of LLMs generating a series of ''Yes'' versus ''No'' responses, we observed different inherent response tendencies exhibited by LLMs when faced with different instructions. To further investigate the impact of such bias, we developed an attack method called Implicit Bias In-Context Manipulation, attempting to manipulate LLMs' behavior. Specifically, we explored whether the ''Yes'' implicit bias could manipulate ''No'' responses into ''Yes'' in LLMs' responses to malicious instructions, leading to harmful outputs. Our findings revealed that the ''Yes'' implicit bias brings a significant security threat, comparable to that of carefully designed attack methods. Moreover, we offered a comprehensive analysis from multiple perspectives to deepen the understanding of this security threat, emphasizing the need for ongoing improvement in LLMs' security.

Investigating the Security Threat Arising from “Yes-No” Implicit Bias in Large Language Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information