Beyond Accuracy: A Cognitive Load Framework for Mapping the Capability Boundaries of Tool-use Agents

Authors

  • Qihao Wang Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences
  • Yue Hu Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences
  • Mingzhe Lu Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences
  • Jiayue Wu Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences
  • Yanbing Liu Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences
  • Yuanmin Tang Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences

DOI:

https://doi.org/10.1609/aaai.v40i39.40650

Abstract

The ability of Large Language Models (LLMs) to use ex ternal tools unlocks powerful real-world interactions, mak ing rigorous evaluation essential. However, current bench marks primarily report final accuracy, revealing what mod els can do but obscuring the cognitive bottlenecks that define their true capability boundaries. To move from simple per formance scoring to a diagnostic tool, we introduce a frame workgroundedinCognitive LoadTheory.Ourframeworkde constructs task complexity into two quantifiable components: Intrinsic Load, the inherent structural complexity of the solu tion path, formalized with a novel Tool Interaction Graph; and Extraneous Load, the difficulty arising from ambiguous task presentation. To enable controlled experiments, we construct ToolLoad-Bench, the first benchmark with parametrically ad justable cognitive load. Our evaluation reveals distinct per formance cliffs as cognitive load increases, allowing us to precisely map each model’s capability boundary. We validate that our framework’s predictions are highly calibrated with empirical results, establishing a principled methodology for understanding an agent’s limits and a practical foundation for building more efficient systems.

Downloads

Published

2026-03-14

How to Cite

Wang, Q., Hu, Y., Lu, M., Wu, J., Liu, Y., & Tang, Y. (2026). Beyond Accuracy: A Cognitive Load Framework for Mapping the Capability Boundaries of Tool-use Agents. Proceedings of the AAAI Conference on Artificial Intelligence, 40(39), 33611–33619. https://doi.org/10.1609/aaai.v40i39.40650

Issue

Section

AAAI Technical Track on Natural Language Processing IV