Unsupervised Detection of Long-Term Idle Periods in Large-Scale On-Premises Server Fleets

Ahmed Javed; Haneul Yang; Zohaib Shahid

doi:10.1609/aaaiss.v9i1.42906

Authors

Ahmed Javed OnitsAI Inc., South Korea
Haneul Yang OnitsAI Inc., South Korea
Zohaib Shahid Loughborough University London, United Kingdom

DOI:

https://doi.org/10.1609/aaaiss.v9i1.42906

Abstract

As on-premises GPU and server fleets scale to meet AI work- load demands, substantial hardware assets remain underuti- lized, resulting in prolonged, high-cost “idle” periods. De- tecting these segments in large-scale environments is inher- ently difficult due to the absence of ground-truth labels and the high volatility of modern workloads. We propose an un- supervised pipeline for identifying long-term idle intervals in unlabeled multivariate utilization time series. By leverag- ing daily volatility vectors across CPU, memory, GPU, and storage metrics (/data space and /root space), our novel framework, the BGMM-HMM, employs a Bayesian Gaus- sian Mixture Model for state clustering followed by a Hid- den Markov Model to enforce temporal consistency. Experi- ments on production server-fleet data show that the BGMM- HMM identifies underutilized assets ≈5×more effectively than traditional rule-based baselines. Critically, ablation stud- ies demonstrate that the HMM integration reduces spurious state-switching by >90% compared to standalone clustering, providing the stable, contiguous intervals necessary for prac- tical resource reclamation. Furthermore, robustness tests via synthetic noise injection confirm a 98.3% sensitivity to work- load spikes. This framework provides a scalable and opera- tionally stable tool for infrastructure optimization and ESG- aligned sustainable computing.

Unsupervised Detection of Long-Term Idle Periods in Large-Scale On-Premises Server Fleets

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information