FedAdamW: A Communication-Efficient Optimizer with Convergence and Generalization Guarantees for Federated Large Models

Junkang Liu; Fanhua Shang; Hongying Liu; Yuxuan Tian; Yuanyuan Liu; Jin Liu; Kewen Zhu; Zhouchen Lin

doi:10.1609/aaai.v40i28.39549

Authors

Junkang Liu Tianjin University
Fanhua Shang Tianjin University
Hongying Liu Tianjin University
Yuxuan Tian Institute of automation, Chinese academy of science, Chinese Academy of Sciences
Yuanyuan Liu Xidian University
Jin Liu Xi'an University of Electronic Science and Technology
Kewen Zhu Tianjin University
Zhouchen Lin Peking University, Pazhou Laboratory (Huangpu)

DOI:

https://doi.org/10.1609/aaai.v40i28.39549

Abstract

AdamW has become one of the most effective optimizers for training large-scale models. We have also observed its effectiveness in the context of federated learning (FL). However, directly applying AdamW in federated learning settings poses significant challenges: (1) due to data heterogeneity, AdamW often yields high variance in the second-moment estimate v; (2) the local overfitting of AdamW may cause client drift; and (3) Reinitializing moment estimates (v, m) at each round slows down convergence. To address these challenges, we propose the first Federated AdamW algorithm, called FedAdamW, for training and fine-tuning various large models. FedAdamW aligns local updates with the global update using both a local correction mechanism and decoupled weight decay to mitigate local overfitting. FedAdamW efficiently aggregates the mean of the second-moment estimates to reduce their variance and reinitialize them. Theoretically, we prove that FedAdamW achieves a linear speedup convergence rate of O（p（L∆σ2l ）/（SKRε2） + （L∆）/R） without heterogeneity assumption, where S is the number of participating clients per round, K is the number of local iterations, and R is the total number of communication rounds. We also employ PAC-Bayesian generalization analysis to explain the effectiveness of decoupled weight decay in local training. Empirically, we validate the effectiveness of FedAdamW on language and vision Transformer models. Compared to several baselines, FedAdamW significantly reduces communication rounds and improves test accuracy.

FedAdamW: A Communication-Efficient Optimizer with Convergence and Generalization Guarantees for Federated Large Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information