Unsupervised Detection of Long-Term Idle Periods in Large-Scale On-Premises Server Fleets
DOI:
https://doi.org/10.1609/aaaiss.v9i1.42906Abstract
As on-premises GPU and server fleets scale to meet AI work- load demands, substantial hardware assets remain underuti- lized, resulting in prolonged, high-cost “idle” periods. De- tecting these segments in large-scale environments is inher- ently difficult due to the absence of ground-truth labels and the high volatility of modern workloads. We propose an un- supervised pipeline for identifying long-term idle intervals in unlabeled multivariate utilization time series. By leverag- ing daily volatility vectors across CPU, memory, GPU, and storage metrics (/data space and /root space), our novel framework, the BGMM-HMM, employs a Bayesian Gaus- sian Mixture Model for state clustering followed by a Hid- den Markov Model to enforce temporal consistency. Experi- ments on production server-fleet data show that the BGMM- HMM identifies underutilized assets ≈5×more effectively than traditional rule-based baselines. Critically, ablation stud- ies demonstrate that the HMM integration reduces spurious state-switching by >90% compared to standalone clustering, providing the stable, contiguous intervals necessary for prac- tical resource reclamation. Furthermore, robustness tests via synthetic noise injection confirm a 98.3% sensitivity to work- load spikes. This framework provides a scalable and opera- tionally stable tool for infrastructure optimization and ESG- aligned sustainable computing.Downloads
Published
2026-06-23
How to Cite
Javed, A., Yang, H., & Shahid, Z. (2026). Unsupervised Detection of Long-Term Idle Periods in Large-Scale On-Premises Server Fleets. Proceedings of the AAAI Symposium Series, 9(1), 61–68. https://doi.org/10.1609/aaaiss.v9i1.42906
Issue
Section
AI-Driven Resilience: Building Robust, Adaptive Technologies for a Dynamic World (Full Papers)