Scaling and Transferability of Annealing Strategies in Large Language Model Training

Authors

  • Siqi Wang The Hong Kong University of Science and Technology
  • Zhengyu Chen Meituan Inc.
  • Teng Xiao Allen Institute for Artificial Intelligence, USA
  • Zheqi Lv Meituan Inc.
  • Jinluan Yang Meituan Inc.
  • Xunliang Cai Meituan Inc.
  • Jingang Wang Meituan Inc.
  • Xiaomeng Li The Hong Kong University of Science and Technology Shenzhen Loop Area Institute

DOI:

https://doi.org/10.1609/aaai.v40i40.40653

Abstract

Learning rate scheduling is crucial for training large language models, yet understanding the optimal annealing strategies across different model configurations remains challenging. In this work, we investigate the transferability of annealing dynamics in large language model training and refine a generalized predictive framework for optimizing annealing strategies under the Warmup-Steady-Decay (WSD) scheduler. Our improved framework incorporates training steps, maximum learning rate, and annealing behavior, enabling more efficient optimization of learning rate schedules. Our work provides a practical guidance for selecting optimal annealing strategies without exhaustive hyperparameter searches, demonstrating that smaller models can serve as reliable proxies for optimizing the training dynamics of larger models. We validate our findings on extensive experiments using both Dense and Mixture-of-Experts (MoE) models, demonstrating that optimal annealing ratios follow consistent patterns and can be transferred across different training configurations.

Published

2026-03-14

How to Cite

Wang, S., Chen, Z., Xiao, T., Lv, Z., Yang, J., Cai, X., … Li, X. (2026). Scaling and Transferability of Annealing Strategies in Large Language Model Training. Proceedings of the AAAI Conference on Artificial Intelligence, 40(40), 33639–33647. https://doi.org/10.1609/aaai.v40i40.40653

Issue

Section

AAAI Technical Track on Natural Language Processing V