Improved Penalty Method via Doubly Stochastic Gradients for Bilevel Hyperparameter Optimization


  • Wanli Shi Nanjing University of Information Science & Technology, P.R.China
  • Bin Gu Nanjing University of Information Science & Technology, P.R.China MBZUAI, United Arab Emirates JD Finance America Corporation, Mountain View, CA, USA


Hyperparameter Tuning / Algorithm Configuration


Hyperparameter optimization (HO) is an important problem in machine learning which is normally formulated as a bilevel optimization problem. Gradient-based methods are dominant in bilevel optimization due to their high scalability to the number of hyperparameters, especially in a deep learning problem. However, traditional gradient-based bilevel optimization methods need intermediate steps to obtain the exact or approximate gradient of hyperparameters, namely hypergradient, for the upper-level objective, whose complexity is high especially for high dimensional datasets. Recently, a penalty method has been proposed to avoid the computation of the hypergradient, which speeds up the gradient-based BHO methods. However, the penalty method may result in a very large number of constraints, which greatly limits the efficiency of this method, especially for high dimensional data problems. To address this limitation, in this paper, we propose a doubly stochastic gradient descent algorithm (DSGPHO) to improve the efficiency of the penalty method. Importantly, we not only prove the proposed method can converge to the KKT condition of the original problem in a convex setting, but also provide the convergence rate of DSGPHO which is the first result in the references of gradient-based bilevel optimization as far as we know. We compare our method with three state-of-the-art gradient-based methods in three tasks, i.e., data denoising, few-shot learning, and training data poisoning, using several large-scale benchmark datasets. All the results demonstrate that our method outperforms or is comparable to the existing methods in terms of accuracy and efficiency.




How to Cite

Shi, W., & Gu, B. (2021). Improved Penalty Method via Doubly Stochastic Gradients for Bilevel Hyperparameter Optimization. Proceedings of the AAAI Conference on Artificial Intelligence, 35(11), 9621-9629. Retrieved from



AAAI Technical Track on Machine Learning IV