(1)

Yu, H.; Yang, S.; Zhu, S. Parallel Restarted SGD With Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning. AAAI 2019, 33, 5693-5700.