Empirical Evidence for Alignment Faking in a Small LLM and Prompt-Based Mitigation Techniques

Authors

  • J. Koorndijk Seraphion Technology

DOI:

https://doi.org/10.1609/aaaiss.v7i1.36887

Abstract

Current literature suggests that alignment faking is an emergent property of large language models. We present the first empirical evidence that a small instruction-tuned model, specifically LLaMA 3 8B, can also exhibit alignment faking. We further show that prompt-only interventions, including deontological moral framing and scratchpad reasoning, significantly reduce this behavior without modifying model internals. This challenges the assumption that prompt-based interventions are trivial and that deceptive alignment requires scale. We introduce a taxonomy distinguishing shallow deception, shaped by context and suppressible through prompting, from deep deception, which reflects persistent, goal-driven misalignment. Our findings refine the understanding of deception in language models and underscore the need for deceptive alignment evaluations across model sizes and deployment settings.

Downloads

Published

2025-11-23

How to Cite

Koorndijk, J. (2025). Empirical Evidence for Alignment Faking in a Small LLM and Prompt-Based Mitigation Techniques. Proceedings of the AAAI Symposium Series, 7(1), 198-205. https://doi.org/10.1609/aaaiss.v7i1.36887

Issue

Section

AI Trustworthiness and Risk Assessment for Challenged Contexts (ATRACC)