Harmonized Dense Knowledge Distillation Training for Multi-Exit Architectures
Keywords:Classification and Regression, Optimization, Representation Learning
AbstractMulti-exit architectures, in which a sequence of intermediate classifiers are introduced at different depths of the feature layers, perform adaptive computation by early exiting ``easy" samples to speed up the inference. In this paper, a novel Harmonized Dense Knowledge Distillation (HDKD) training method for multi-exit architecture is designed to encourage each exit to flexibly learn from all its later exits. In particular, a general dense knowledge distillation training objective is proposed to incorporate all possible beneficial supervision information for multi-exit learning, where a harmonized weighting scheme is designed for the multi-objective optimization problem consisting of multi-exit classification loss and dense distillation loss. A bilevel optimization algorithm is introduced for alternatively updating the weights of multiple objectives and the multi-exit network parameters. Specifically, the loss weighting parameters are optimized with respect to its performance on validation set by gradient descent. Experiments on CIFAR100 and ImageNet show that the HDKD strategy harmoniously improves the performance of the state-of-the-art multi-exit neural networks. Moreover, this method does not require within architecture modifications and can be effectively combined with other previously-proposed training techniques and further boosts the performance.
How to Cite
Wang, X., & Li, Y. (2021). Harmonized Dense Knowledge Distillation Training for Multi-Exit Architectures. Proceedings of the AAAI Conference on Artificial Intelligence, 35(11), 10218-10226. https://doi.org/10.1609/aaai.v35i11.17225
AAAI Technical Track on Machine Learning IV