Adaptive Hierarchy-Branch Fusion for Online Knowledge Distillation


  • Linrui Gong East China Normal University, China
  • Shaohui Lin East China Normal University, China
  • Baochang Zhang Beihang University, China
  • Yunhang Shen Tencent Youtu Lab, China
  • Ke Li Tencent Youtu Lab, China
  • Ruizhi Qiao Tencent Youtu Lab, China
  • Bo Ren Tencent Youtu Lab, China
  • Muqing Li Tencent Youtu Lab, China
  • Zhou Yu East China Normal University, China Key Laboratory of Advanced Theory and Application in Statistics and Data Science - MOE, China
  • Lizhuang Ma East China Normal University, China



ML: Learning on the Edge & Model Compression, ML: Auto ML and Hyperparameter Tuning, ML: Ensemble Methods


Online Knowledge Distillation (OKD) is designed to alleviate the dilemma that the high-capacity pre-trained teacher model is not available. However, the existing methods mostly focus on improving the ensemble prediction accuracy from multiple students (a.k.a. branches), which often overlook the homogenization problem that makes student model saturate quickly and hurts the performance. We assume that the intrinsic bottleneck of the homogenization problem comes from the identical branch architecture and coarse ensemble strategy. We propose a novel Adaptive Hierarchy-Branch Fusion framework for Online Knowledge Distillation, termed AHBF-OKD, which designs hierarchical branches and adaptive hierarchy-branch fusion module to boost the model diversity and aggregate complementary knowledge. Specifically, we first introduce hierarchical branch architectures to construct diverse peers by increasing the depth of branches monotonously on the basis of target branch. To effectively transfer knowledge from the most complex branch to the simplest target branch, we propose an adaptive hierarchy-branch fusion module to create hierarchical teacher assistants recursively, which regards the target branch as the smallest teacher assistant. During the training, the teacher assistant from the previous hierarchy is explicitly distilled by the teacher assistant and the branch from the current hierarchy. Thus, the important scores to different branches are effectively and adaptively allocated to reduce the branch homogenization. Extensive experiments demonstrate the effectiveness of AHBF-OKD on different datasets, including CIFAR-10/100 and ImageNet 2012. For example, on ImageNet 2012, the distilled ResNet-18 achieves Top-1 error of 29.28\%, which significantly outperforms the state-of-the-art methods. The source code is available at




How to Cite

Gong, L., Lin, S., Zhang, B., Shen, Y., Li, K., Qiao, R., Ren, B., Li, M., Yu, Z., & Ma, L. (2023). Adaptive Hierarchy-Branch Fusion for Online Knowledge Distillation. Proceedings of the AAAI Conference on Artificial Intelligence, 37(6), 7731-7739.



AAAI Technical Track on Machine Learning I