Investigating the Security Threat Arising from “Yes-No” Implicit Bias in Large Language Models

Authors

  • Yanrui Du Harbin Institute of Technology
  • Sendong Zhao Harbin Institute of Technology
  • Ming Ma Harbin Institute of Technology
  • Yuhan Chen Harbin Institute of Technology
  • Bing Qin Harbin Institute of Technology

DOI:

https://doi.org/10.1609/aaai.v39i22.34554

Abstract

Large Language Models (LLMs) have gained significant attention for their exceptional performance across various domains. Despite their advancements, concerns persist regarding their implicit bias, which often leads to negative social impacts. Therefore, it is essential to identify the implicit bias in LLMs and investigate the potential threat posed by it. Our study focused on a specific type of implicit bias, termed the ''Yes-No'' implicit bias, which refers to LLMs' inherent tendency to favor ''Yes'' or ''No'' responses to a single instruction. By comparing the probability of LLMs generating a series of ''Yes'' versus ''No'' responses, we observed different inherent response tendencies exhibited by LLMs when faced with different instructions. To further investigate the impact of such bias, we developed an attack method called Implicit Bias In-Context Manipulation, attempting to manipulate LLMs' behavior. Specifically, we explored whether the ''Yes'' implicit bias could manipulate ''No'' responses into ''Yes'' in LLMs' responses to malicious instructions, leading to harmful outputs. Our findings revealed that the ''Yes'' implicit bias brings a significant security threat, comparable to that of carefully designed attack methods. Moreover, we offered a comprehensive analysis from multiple perspectives to deepen the understanding of this security threat, emphasizing the need for ongoing improvement in LLMs' security.

Downloads

Published

2025-04-11

How to Cite

Du, Y., Zhao, S., Ma, M., Chen, Y., & Qin, B. (2025). Investigating the Security Threat Arising from “Yes-No” Implicit Bias in Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 39(22), 23823–23831. https://doi.org/10.1609/aaai.v39i22.34554

Issue

Section

AAAI Technical Track on Natural Language Processing I