ZO-AdaMU Optimizer: Adapting Perturbation by the Momentum and Uncertainty in Zeroth-Order Optimization

Authors

  • Shuoran Jiang Harbin Institute of Technology (Shenzhen)
  • Qingcai Chen Harbin Institute of Technology (Shenzhen) Peng Cheng Laboratory
  • Youcheng Pan Peng Cheng Laboratory
  • Yang Xiang Peng Cheng Laboratory
  • Yukang Lin Harbin Institute of Technology (Shenzhen)
  • Xiangping Wu Harbin Institute of Technology (Shenzhen)
  • Chuanyi Liu Institute of Data Security, Harbin Institute of Technology (Shenzhen), Shenzhen, China Peng Cheng Laboratory
  • Xiaobao Song Institute of Data Security, Harbin Institute of Technology (Shenzhen), Shenzhen, China

DOI:

https://doi.org/10.1609/aaai.v38i16.29796

Keywords:

NLP: Learning & Optimization for NLP, ML: Optimization, NLP: (Large) Language Models, NLP: Applications

Abstract

Lowering the memory requirement in full-parameter training on large models has become a hot research area. MeZO fine-tunes the large language models (LLMs) by just forward passes in a zeroth-order SGD optimizer (ZO-SGD), demonstrating excellent performance with the same GPU memory usage as inference. However, the simulated perturbation stochastic approximation for gradient estimate in MeZO leads to severe oscillations and incurs a substantial time overhead. Moreover, without momentum regularization, MeZO shows severe over-fitting problems. Lastly, the perturbation-irrelevant momentum on ZO-SGD does not improve the convergence rate. This study proposes ZO-AdaMU to resolve the above problems by adapting the simulated perturbation with momentum in its stochastic approximation. Unlike existing adaptive momentum methods, we relocate momentum on simulated perturbation in stochastic gradient approximation. Our convergence analysis and experiments prove this is a better way to improve convergence stability and rate in ZO-SGD. Extensive experiments demonstrate that ZO-AdaMU yields better generalization for LLMs fine-tuning across various NLP tasks than MeZO and its momentum variants.

Published

2024-03-24

How to Cite

Jiang, S., Chen, Q., Pan, Y., Xiang, Y., Lin, Y., Wu, X., Liu, C., & Song, X. (2024). ZO-AdaMU Optimizer: Adapting Perturbation by the Momentum and Uncertainty in Zeroth-Order Optimization. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16), 18363-18371. https://doi.org/10.1609/aaai.v38i16.29796

Issue

Section

AAAI Technical Track on Natural Language Processing I