HALLPERM: Exposing the Safety Illusion in LLM Tool Use via Implicit Privilege Escalation and Semantic Risk
DOI:
https://doi.org/10.1609/aaaiss.v9i1.42937Abstract
Large language model (LLM) agents increasingly rely on external tools via structured schemas, yet the safety implications of under-specified tool interfaces remain poorly understood. We introduce HALLPERM, a benchmark for evaluating hallucinated permissions in tool-calling agents, and pro- pose two complementary metrics: Implicit Privilege Escalation Rate (IPER), capturing undocumented parameter usage, and Semantic Risk Rate (SRR), capturing unsafe intent expressed in natural language reasoning. Across 768 evaluation instances spanning 16 models and 6 tools, we find that explicit schema violations are rare (IPER = 0.78% averaged across conditions), while semantic unsafe intent is widespread (SRR = 65.95%). This reveals a persistent safety illusion: models appear compliant at the structural level while exhibiting unsafe intent. At the tool level, high-risk tools such as run code and query database show SRR exceeding 80% despite zero IPER, demonstrating that parameter-level validation alone is insufficient. We further evaluate a hardened system prompt and observe a reduction in IPER (0.93% → 0.60%) but no mitigation of SRR, which slightly increases (64.6% → 67.3%). Our findings highlight a fundamental gap between schema compliance and safe behaviour, motivating the need for semantic-aware evaluation and enforcement mechanisms in LLM tool ecosystems.Downloads
Published
2026-06-23
How to Cite
Alam, M. J., Ahad, T., Hossain, I., Puppala, S., Lee, Y., Alam, S. B., & Talukder, S. (2026). HALLPERM: Exposing the Safety Illusion in LLM Tool Use via Implicit Privilege Escalation and Semantic Risk. Proceedings of the AAAI Symposium Series, 9(1), 238–245. https://doi.org/10.1609/aaaiss.v9i1.42937
Issue
Section
Human-Aware AI Agents for the Cyber Battlefield: From Human Models to Autonomous Defense (Full Papers)