Layerwise Sparse Coding for Pruned Deep Neural Networks with Extreme Compression Ratio
Deep neural network compression is important and increasingly developed especially in resource-constrained environments, such as autonomous drones and wearable devices. Basically, we can easily and largely reduce the number of weights of a trained deep model by adopting a widely used model compression technique, e.g., pruning. In this way, two kinds of data are usually preserved for this compressed model, i.e., non-zero weights and meta-data, where meta-data is employed to help encode and decode these non-zero weights. Although we can obtain an ideally small number of non-zero weights through pruning, existing sparse matrix coding methods still need a much larger amount of meta-data (may several times larger than non-zero weights), which will be a severe bottleneck of the deploying of very deep models. To tackle this issue, we propose a layerwise sparse coding (LSC) method to maximize the compression ratio by extremely reducing the amount of meta-data. We first divide a sparse matrix into multiple small blocks and remove zero blocks, and then propose a novel signed relative index (SRI) algorithm to encode the remaining non-zero blocks (with much less meta-data). In addition, the proposed LSC performs parallel matrix multiplication without full decoding, while traditional methods cannot. Through extensive experiments, we demonstrate that LSC achieves substantial gains in pruned DNN compression (e.g., 51.03x compression ratio on ADMM-Lenet) and inference computation (i.e., time reduction and extremely less memory bandwidth), over state-of-the-art baselines.