Indirect Stochastic Gradient Quantization and Its Application in Distributed Deep Learning
Transmitting the gradients or model parameters is a critical bottleneck in distributed training of large models. To mitigate this issue, we propose an indirect quantization and compression of stochastic gradients (SG) via factorization. The gist of the idea is that, in contrast to the direct compression methods, we focus on the factors in SGs, i.e., the forward and backward signals in the backpropagation algorithm. We observe that these factors are correlated and generally sparse in most deep models. This gives rise to rethinking of the approaches for quantization and compression of gradients with the ultimate goal of minimizing the error in the final computed gradients subject to the desired communication constraints. We have proposed and theoretically analyzed different indirect SG quantization (ISGQ) methods. The proposed ISGQ reduces the reconstruction error in SGs compared to the direct quantization methods with the same number of quantization bits. Moreover, it can achieve compression gains of more than 100, while the existing traditional quantization schemes can achieve compression ratio of at most 32 (quantizing to 1 bit). Further, for a fixed total batch-size, the required transmission bit-rate per worker decreases in ISGQ as the number of workers increases.