Predicting Emergent Tool Use in LLMs Before It Emerges: A Proxy Perspective

Authors

  • Bo-Wen Zhang University of Science and Technology Beijing
  • Yan Yan China University of Mining Technology Beijing
  • Guang Liu Beijing Academy of Artificial Intelligence
  • Xu-Cheng Yin University of Science and Technology Beijing

DOI:

https://doi.org/10.1609/aaai.v40i41.40763

Abstract

Tool-use capabilities fundamentally transform large language models (LLMs) from passive language generators into active agents with real-world utility, drawing intense research focus. Yet, their emergent nature renders traditional scaling laws ineffective for early-stage prediction, obstructing principled model design and efficient training. In this work, we propose a proxy-task perspective that predicts tool-use capabilities by measuring early model performance on selected non-emergent proxy tasks. Our method quantifies two properties of each proxy task: alignment, which reflects how well it captures tool-use trajectories, and stability, which indicates how consistently it behaves across training conditions. These properties are used to weight predictive signals. Theoretically, we formalize how these weighted signals approximate emergent tool use through bounded extrapolation under relaxed assumptions. Empirically, we validate our approach across training checkpoints, model scales, and data setups. Results show that a carefully weighted ensemble of proxy tasks can accurately rank downstream tool-use ability long before it arises. Our findings provide new theoretical foundations and practical tools for efficient training and capability planning, and advance the understanding of how complex abilities arise in LLMs.

Downloads

Published

2026-03-14

How to Cite

Zhang, B.-W., Yan, Y., Liu, G., & Yin, X.-C. (2026). Predicting Emergent Tool Use in LLMs Before It Emerges: A Proxy Perspective. Proceedings of the AAAI Conference on Artificial Intelligence, 40(41), 34629–34637. https://doi.org/10.1609/aaai.v40i41.40763

Issue

Section

AAAI Technical Track on Natural Language Processing VI