Adaptive Hierarchy-Branch Fusion for Online Knowledge Distillation

Linrui Gong; Shaohui Lin; Baochang Zhang; Yunhang Shen; Ke Li; Ruizhi Qiao; Bo Ren; Muqing Li; Zhou Yu; Lizhuang Ma

doi:10.1609/aaai.v37i6.25937

Authors

Linrui Gong East China Normal University, China
Shaohui Lin East China Normal University, China
Baochang Zhang Beihang University, China
Yunhang Shen Tencent Youtu Lab, China
Ke Li Tencent Youtu Lab, China
Ruizhi Qiao Tencent Youtu Lab, China
Bo Ren Tencent Youtu Lab, China
Muqing Li Tencent Youtu Lab, China
Zhou Yu East China Normal University, China Key Laboratory of Advanced Theory and Application in Statistics and Data Science - MOE, China
Lizhuang Ma East China Normal University, China

DOI:

https://doi.org/10.1609/aaai.v37i6.25937

Keywords:

ML: Learning on the Edge & Model Compression, ML: Auto ML and Hyperparameter Tuning, ML: Ensemble Methods

Abstract

Online Knowledge Distillation (OKD) is designed to alleviate the dilemma that the high-capacity pre-trained teacher model is not available. However, the existing methods mostly focus on improving the ensemble prediction accuracy from multiple students (a.k.a. branches), which often overlook the homogenization problem that makes student model saturate quickly and hurts the performance. We assume that the intrinsic bottleneck of the homogenization problem comes from the identical branch architecture and coarse ensemble strategy. We propose a novel Adaptive Hierarchy-Branch Fusion framework for Online Knowledge Distillation, termed AHBF-OKD, which designs hierarchical branches and adaptive hierarchy-branch fusion module to boost the model diversity and aggregate complementary knowledge. Specifically, we first introduce hierarchical branch architectures to construct diverse peers by increasing the depth of branches monotonously on the basis of target branch. To effectively transfer knowledge from the most complex branch to the simplest target branch, we propose an adaptive hierarchy-branch fusion module to create hierarchical teacher assistants recursively, which regards the target branch as the smallest teacher assistant. During the training, the teacher assistant from the previous hierarchy is explicitly distilled by the teacher assistant and the branch from the current hierarchy. Thus, the important scores to different branches are effectively and adaptively allocated to reduce the branch homogenization. Extensive experiments demonstrate the effectiveness of AHBF-OKD on different datasets, including CIFAR-10/100 and ImageNet 2012. For example, on ImageNet 2012, the distilled ResNet-18 achieves Top-1 error of 29.28\%, which significantly outperforms the state-of-the-art methods. The source code is available at https://github.com/linruigong965/AHBF.

Adaptive Hierarchy-Branch Fusion for Online Knowledge Distillation

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription