AssetOpsBench-Live: Privacy-Aware Online Evaluation of Multi-Agent Performance in Industrial Operations

Authors

  • Dhaval Patel IBM Research, Yorktown
  • Nianjun Zhou IBM Research, Yorktown
  • Shuxin Lin IBM Research, Yorktown
  • James Rayfield IBM Research, Yorktown
  • Chathurangi Shyalika Artificial Intelligence Institute, University of South Carolina
  • Suryanarayana Reddy Yarrabothula Steel Authority of India Limited

DOI:

https://doi.org/10.1609/aaai.v40i48.42372

Abstract

Industrial automation increasingly relies on multi-agent AI, yet evaluation remains difficult due to task complexity and data confidentiality. We present AssetOpsBench-Live, a demo of a competition-ready platform for real-time, privacy-preserving evaluation of multi-agent AI in industrial contexts. The platform integrates AssetOpsBench, which measures six dimensions of multi-agent performance and performs automated failure-mode discovery, with Codabench, which supports reproducible, code-oriented competitions. End users first validate agents locally, then submit containerized code for execution on hidden industrial scenarios. Instead of raw trajectories, the system provides quantitative scores and clustered failure modes (e.g., reasoning--action mismatch, step repetition), enabling participants to identify failures, apply targeted improvements, and iteratively resubmit. By combining competition-based engagement with actionable diagnostics, AssetOpsBench-Live delivers reproducible, real-time insights reflecting real-world industrial constraints.

Downloads

Published

2026-03-14

How to Cite

Patel, D., Zhou, N., Lin, S., Rayfield, J., Shyalika, C., & Yarrabothula, S. R. (2026). AssetOpsBench-Live: Privacy-Aware Online Evaluation of Multi-Agent Performance in Industrial Operations. Proceedings of the AAAI Conference on Artificial Intelligence, 40(48), 41658–41660. https://doi.org/10.1609/aaai.v40i48.42372