Abstract
The recent many-fold increase in the size of deep neural networks makes
efficient distributed training challenging. Many proposals exploit the
compressibility of the gradients and propose lossy compression techniques to
speed up the communication stage of distributed training. Nevertheless,
compression comes at the cost of reduced model quality and extra computation
overhead. In this work, we design an efficient compressor with minimal
overhead. Noting the sparsity of the gradients, we propose to model the
gradients as random variables distributed according to some sparsity-inducing
distributions (SIDs). We empirically validate our assumption by studying the
statistical characteristics of the evolution of gradient vectors over the
training process. We then propose Sparsity-Inducing Distribution-based
Compression (SIDCo), a threshold-based sparsification scheme that enjoys
similar threshold estimation quality to deep gradient compression (DGC) while
being faster by imposing lower compression overhead. Our extensive evaluation
of popular machine learning benchmarks involving both recurrent neural network
(RNN) and convolution neural network (CNN) models shows that SIDCo speeds up
training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression
baseline, Topk, and DGC compressors, respectively.