From Benchmarks to Business Impact: Deploying IBM Generalist Agent in Enterprise Production
DOI:
https://doi.org/10.1609/aaai.v40i47.41485Abstract
Agents are rapidly advancing in automating digital work, but enterprises face a harder challenge: moving beyond prototypes to deployed systems that deliver measurable business value. This path is complicated by fragmented frameworks, slow development, and the absence of standardized evaluation practices. Generalist agents have emerged as a promising direction, excelling on academic benchmarks and offering flexibility across tasks, applications, and modalities. Yet, evidence of their use in enterprise settings remains limited. This paper reports IBM’s experience developing and piloting the Computer Using Generalist Agent (CUGA). CUGA adopts a hierarchical planner--executor architecture with strong analytical foundations, achieving state-of-the-art performance on AppWorld and WebArena. Beyond benchmarks, it was evaluated in a Business-Process-Outsourcing talent acquisition pilot, addressing enterprise requirements for scalability, auditability, safety, and governance. In preliminary evaluations, CUGA approached the accuracy of specialized agents while suggesting reductions in development time and cost. We provide early evidence that generalist agents can operate at enterprise scale, distill key technical and organizational lessons, and outline requirements for transitioning research-grade architectures like CUGA into enterprise-ready systems.Downloads
Published
2026-03-14
How to Cite
Shlomov, S., Oved, A., Marreed, S., Levy, I., Akrabi, O., Yaeli, A., … Adi, A. (2026). From Benchmarks to Business Impact: Deploying IBM Generalist Agent in Enterprise Production. Proceedings of the AAAI Conference on Artificial Intelligence, 40(47), 40423–40431. https://doi.org/10.1609/aaai.v40i47.41485
Issue
Section
IAAI Technical Track on Emerging Applications of AI