Control Illusion: The Failure of Instruction Hierarchies in Large Language Models

Authors

  • Yilin Geng University of Melbourne
  • Haonan Li Mohamed bin Zayed University of Artificial Intelligence
  • Honglin Mu Mohamed bin Zayed University of Artificial Intelligence
  • Xudong Han Mohamed bin Zayed University of Artificial Intelligence
  • Timothy Baldwin Mohamed bin Zayed University of Artificial Intelligence University of Melbourne
  • Omri Abend Hebrew University of Jerusalem
  • Eduard Hovy University of Melbourne
  • Lea Frermann University of Melbourne

DOI:

https://doi.org/10.1609/aaai.v40i36.40339

Abstract

Large language models (LLMs) are increasingly deployed with hierarchical instruction schemes, where certain instructions (e.g., system-level directives) are expected to take precedence over others (e.g., user messages). Yet, we lack a systematic understanding of how effectively these hierarchical control mechanisms work. We introduce a systematic evaluation framework based on constraint prioritization to assess how well LLMs enforce instruction hierarchies. Our experiments across six state-of-the-art LLMs reveal that models struggle with consistent instruction prioritization, even for simple formatting conflicts. We find that the widely-adopted system/user prompt separation fails to establish a reliable instruction hierarchy, and models exhibit strong inherent biases toward certain constraint types regardless of their priority designation. Interestingly, we also find that societal hierarchy framings (e.g., authority, expertise, consensus) show stronger influence on model behavior than system/user roles, suggesting that pretraining-derived social structures function as latent behavioral priors with potentially greater impact than post-training guardrails.

Downloads

Published

2026-03-14

How to Cite

Geng, Y., Li, H., Mu, H., Han, X., Baldwin, T., Abend, O., Hovy, E., & Frermann, L. (2026). Control Illusion: The Failure of Instruction Hierarchies in Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(36), 30816-30824. https://doi.org/10.1609/aaai.v40i36.40339

Issue

Section

AAAI Technical Track on Natural Language Processing I