Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs
DOI:
https://doi.org/10.1609/aaai.v40i3.37203Abstract
Large Language Models (LLMs) demonstrate impressive capabilities across diverse tasks, yet their safety mechanisms remain susceptible to adversarial exploitation of cognitive biases---systematic deviations from rational judgment. Unlike prior studies focusing on isolated biases, this work highlights the overlooked power of multi-bias interactions in undermining LLM safeguards. Specifically, we propose CognitiveAttack, a novel red-teaming framework that adaptively selects optimal ensembles from 154 human social psychology-defined cognitive biases, engineering them into adversarial prompts to effectively compromise LLM safety mechanisms. Experimental results reveal systemic vulnerabilities across 30 mainstream LLMs, particularly open-source variants. CognitiveAttack achieves a substantially higher attack success rate than the SOTA black-box method PAP (60.1% vs. 31.6%), exposing critical limitations in current defenses. Through quantitative analysis of successful jailbreaks, we further identify vulnerability patterns in safety-aligned LLMs under synergistic cognitive biases, validating multi-bias interactions as a potent yet underexplored attack vector. This work introduces a novel interdisciplinary perspective by bridging cognitive science and LLM safety, paving the way for more robust and human-aligned AI systems.Downloads
Published
2026-03-14
How to Cite
Yang, X., Zhou, B., Tang, X., Han, J., & Hu, S. (2026). Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs. Proceedings of the AAAI Conference on Artificial Intelligence, 40(3), 2200-2208. https://doi.org/10.1609/aaai.v40i3.37203
Issue
Section
AAAI Technical Track on Cognitive Modeling & Cognitive Systems