Empirical Evidence for Alignment Faking in a Small LLM and Prompt-Based  Mitigation Techniques

J. Koorndijk

doi:10.1609/aaaiss.v7i1.36887

Empirical Evidence for Alignment Faking in a Small LLM and Prompt-Based Mitigation Techniques

Authors

J. Koorndijk Seraphion Technology

DOI:

https://doi.org/10.1609/aaaiss.v7i1.36887

Abstract

Current literature suggests that alignment faking is an emergent property of large language models. We present the first empirical evidence that a small instruction-tuned model, specifically LLaMA 3 8B, can also exhibit alignment faking. We further show that prompt-only interventions, including deontological moral framing and scratchpad reasoning, significantly reduce this behavior without modifying model internals. This challenges the assumption that prompt-based interventions are trivial and that deceptive alignment requires scale. We introduce a taxonomy distinguishing shallow deception, shaped by context and suppressible through prompting, from deep deception, which reflects persistent, goal-driven misalignment. Our findings refine the understanding of deception in language models and underscore the need for deceptive alignment evaluations across model sizes and deployment settings.

AAAI Fall Symposium 2025 Proceedings Cover

Downloads

Published

2025-11-23

How to Cite

Koorndijk, J. (2025). Empirical Evidence for Alignment Faking in a Small LLM and Prompt-Based Mitigation Techniques. Proceedings of the AAAI Symposium Series, 7(1), 198-205. https://doi.org/10.1609/aaaiss.v7i1.36887

Download Citation

Issue

Vol. 7 No. 1: Proceedings of the 2025 AAAI Fall Symposium Series

Section

AI Trustworthiness and Risk Assessment for Challenged Contexts (ATRACC)