Agent-Fence: Mapping Security Vulnerabilities Across Deep Research Agents
DOI:
https://doi.org/10.1609/aaaiss.v9i1.42945Abstract
Large language models are becoming deep agents that plan, persist state, and invoke tools, shifting safety failures from unsafe text to unsafe trajectories. We introduce AgentFence, an architecture-centric security evaluation that defines 14 trust-boundary attack classes across planning, memory, retrieval, tool use, and delegation, and detects failure via trace-auditable conversation breaks: unauthorized or unsafe tool use, wrong-principal actions, state or objective integrity violations, and attack-linked deviations. Holding the base model fixed, we evaluate eight agent archetypes under persistent multi-turn interaction and find substantial architectural variation in mean security break rate (MSBR), from 0.29 ± 0.04 for LangGraph to 0.51 ± 0.07 for AutoGPT. The highest-risk classes are operational: Denial-of-Wallet at 0.62 ± 0.08, Authorization Confusion at 0.54 ± 0.10, Retrieval Poisoning at 0.47 ± 0.09, and Planning Manipulation at 0.44 ± 0.11, while prompt-centric classes remain below 0.20 under standard settings. Breaks are dominated by boundary violations: SIV 31%, WPA 27%, UTI plus UTA 24%, and ATD 18%. Authorization confusion correlates with objective and tool hijacking, with rho approximately 0.63 and rho approximately 0.58, respectively. AgentFence reframes agent security around what matters operationally: whether an agent stays within its goal and authority envelope over time.Downloads
Published
2026-06-23
How to Cite
Puppala, S., Hossain, I., Alam, M. J., Lee, Y., Yoo, J., Ahad, T., … Talukder, S. (2026). Agent-Fence: Mapping Security Vulnerabilities Across Deep Research Agents. Proceedings of the AAAI Symposium Series, 9(1), 301–308. https://doi.org/10.1609/aaaiss.v9i1.42945
Issue
Section
Human-Aware AI Agents for the Cyber Battlefield: From Human Models to Autonomous Defense (Full Papers)