ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning

Zhewei Yao; Amir Gholami; Sheng Shen; Mustafa Mustafa; Kurt Keutzer; Michael Mahoney

doi:10.1609/aaai.v35i12.17275

Authors

Zhewei Yao University of California, Berkeley
Amir Gholami UC Berkeley
Sheng Shen UC Berkeley
Mustafa Mustafa Lawrence Berkeley National Laboratory
Kurt Keutzer EECS, UC Berkeley
Michael Mahoney "University of California, Berkeley"

DOI:

https://doi.org/10.1609/aaai.v35i12.17275

Keywords:

Optimization, Learning & Optimization for SNLP

Abstract

Incorporating second-order curvature information into machine learning optimization algorithms can be subtle, and doing so naïvely can lead to high per-iteration costs associated with forming the Hessian and performing the associated linear system solve. To address this, we introduce ADAHESSIAN, a new stochastic optimization algorithm. ADAHESSIAN directly incorporates approximate curvature information from the loss function, and it includes several novel performance-improving features, including: (i) a fast Hutchinson based method to approximate the curvature matrix with low computational overhead; (ii) a spatial averaging to reduce the variance of the second derivative; and (iii) a root-mean-square exponential moving average to smooth out variations of the second-derivative across different iterations. We perform extensive tests on NLP, CV, and recommendation system tasks, and ADAHESSIAN achieves state-of-the-art results. In particular, we find that ADAHESSIAN: (i) outperforms AdamW for transformers by0.13/0.33 BLEU score on IWSLT14/WMT14, 2.7/1.0 PPLon PTB/Wikitext-103; (ii) outperforms AdamW for Squeeze-Bert by 0.41 points on GLUE; (iii) achieves 1.45%/5.55%higher accuracy on ResNet32/ResNet18 on Cifar10/ImageNetas compared to Adam; and (iv) achieves 0.032% better score than Adagrad for DLRM on the Criteo Ad Kaggle dataset. The cost per iteration of ADAHESSIANis comparable to first-order methods, and ADAHESSIAN exhibits improved robustness towards variations in hyperparameter values. The code for ADAHESSIAN is open-sourced and publicly-available [1].

ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription