Persistent Instability in LLM’s Personality Measurements: Effects of Scale, Reasoning, and Conversation History

Authors

  • Tommaso Tosato Mila - Quebec AI Institute, Université de Montréal, Montreal, QC, Canada CHU Sainte Justine Research Center, Department of Psychiatry, Université de Montréal, Montreal, QC, Canada Université de Montréal, Montreal, QC, Canada Tara Research
  • Saskia Helbling Mila - Quebec AI Institute, Université de Montréal, Montreal, QC, Canada CHU Sainte Justine Research Center, Department of Psychiatry, Université de Montréal, Montreal, QC, Canada Ernst Strüngmann Institute (ESI) for Neuroscience, Frankfurt, Germany
  • Yorguin-Jose Mantilla-Ramos Mila - Quebec AI Institute, Université de Montréal, Montreal, QC, Canada Université de Montréal, Montreal, QC, Canada Cognitive and Computational Neuroscience Laboratory (CoCo Lab), Université de Montréal, Montreal, QC, Canada
  • Mahmood Hegazy Mila - Quebec AI Institute, Université de Montréal, Montreal, QC, Canada Université de Montréal, Montreal, QC, Canada LiNARiTE.AI
  • Alberto Tosato Tara Research
  • David John Lemay Mila - Quebec AI Institute, Université de Montréal, Montreal, QC, Canada Université de Montréal, Montreal, QC, Canada
  • Irina Rish Mila - Quebec AI Institute, Université de Montréal, Montreal, QC, Canada Université de Montréal, Montreal, QC, Canada
  • Guillaume Dumas Mila - Quebec AI Institute, Université de Montréal, Montreal, QC, Canada CHU Sainte Justine Research Center, Department of Psychiatry, Université de Montréal, Montreal, QC, Canada Université de Montréal, Montreal, QC, Canada

DOI:

https://doi.org/10.1609/aaai.v40i44.41133

Abstract

Large language models require consistent behavioral patterns for safe deployment, yet there are indications of large variability that may lead to an instable expression of personality traits in these models. We present PERSIST (PERsonality Stability in Synthetic Text), a comprehensive evaluation framework testing 25 open-source models (1B-685B parameters) across 2 million+ responses. Using traditional (BFI, SD3) and novel LLM-adapted personality questionnaires, we systematically vary model size, personas, reasoning modes, question order or paraphrasing, and conversation history. Our findings challenge fundamental assumptions: (1) Question reordering alone can introduce large shifts in personality measurements; (2) Scaling provides limited stability gains: even 400B+ models exhibit standard deviations >0.3 on 5-point scales; (3) Interventions expected to stabilize behavior, such as reasoning and inclusion of conversation history, can paradoxically increase variability; (4) Detailed persona instructions produce mixed effects, with misaligned personas showing significantly higher variability than the helpful assistant baseline; (5) The LLM-adapted questionnaires, despite their improved ecological validity, exhibit instability comparable to human-centric versions. This persistent instability across scales and mitigation strategies suggests that current LLMs lack the architectural foundations for genuine behavioral consistency. For safety-critical applications requiring predictable behavior, these findings indicate that current alignment strategies may be inadequate.

Downloads

Published

2026-03-14

How to Cite

Tosato, T., Helbling, S., Mantilla-Ramos, Y.-J., Hegazy, M., Tosato, A., Lemay, D. J., … Dumas, G. (2026). Persistent Instability in LLM’s Personality Measurements: Effects of Scale, Reasoning, and Conversation History. Proceedings of the AAAI Conference on Artificial Intelligence, 40(44), 37961–37969. https://doi.org/10.1609/aaai.v40i44.41133

Issue

Section

AAAI Special Track on AI Alignment