On the Adequacy of Untuned Warmup for Adaptive Optimization

Authors

  • Jerry Ma Booth School of Business, University of Chicago U.S. Patent and Trademark Office, Department of Commerce
  • Denis Yarats Courant Institute of Mathematical Sciences, New York University Facebook AI Resesarch

Keywords:

Optimization, (Deep) Neural Network Algorithms

Abstract

Adaptive optimization algorithms such as Adam (Kingma and Ba, 2014) are widely used in deep learning. The stability of such algorithms is often improved with a warmup schedule for the learning rate. Motivated by the difficulty of choosing and tuning warmup schedules, recent work proposes automatic variance rectification of Adam's adaptive learning rate, claiming that this rectified approach ("RAdam") surpasses the vanilla Adam algorithm and reduces the need for expensive tuning of Adam with warmup. In this work, we refute this analysis and provide an alternative explanation for the necessity of warmup based on the magnitude of the update term, which is of greater relevance to training stability. We then provide some "rule-of-thumb" warmup schedules, and we demonstrate that simple untuned warmup of Adam performs more-or-less identically to RAdam in typical practical settings. We conclude by suggesting that practitioners stick to linear warmup with Adam, with a sensible default being linear warmup over 2 / (1 - β₂) training iterations.

Downloads

Published

2021-05-18

How to Cite

Ma, J., & Yarats, D. (2021). On the Adequacy of Untuned Warmup for Adaptive Optimization. Proceedings of the AAAI Conference on Artificial Intelligence, 35(10), 8828-8836. Retrieved from https://ojs.aaai.org/index.php/AAAI/article/view/17069

Issue

Section

AAAI Technical Track on Machine Learning III