EnViT: Enhancing the Performance of Early-Exit Vision Transformers via Exit-Aware Structured Dropout-Enabled Self-Distillation

Yonghao Dong; Qiang He; Penghong Rui; Zhenzhe Zheng; Zhao Li; Feifei Chen; Hai Jin; Yun Yang

doi:10.1609/aaai.v40i25.39225

Authors

Yonghao Dong National Engineering Research Center for Big Data Technology and System, Wuhan, China Services Computing Technology and System Lab, Wuhan, China Cluster and Grid Computing Lab, Wuhan, China School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China
Qiang He National Engineering Research Center for Big Data Technology and System, Wuhan, China Services Computing Technology and System Lab, Wuhan, China Cluster and Grid Computing Lab, Wuhan, China School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China Swinburne University of Technology, Melbourne, Australia
Penghong Rui National Engineering Research Center for Big Data Technology and System, Wuhan, China Services Computing Technology and System Lab, Wuhan, China Cluster and Grid Computing Lab, Wuhan, China School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China
Zhenzhe Zheng Shanghai Jiao Tong University, Shanghai, China
Zhao Li Zhejiang Lab, Hangzhou, China
Feifei Chen Deakin University, Melbourne, Australia
Hai Jin National Engineering Research Center for Big Data Technology and System, Wuhan, China Services Computing Technology and System Lab, Wuhan, China Cluster and Grid Computing Lab, Wuhan, China School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China
Yun Yang Swinburne University of Technology, Melbourne, Australia

DOI:

https://doi.org/10.1609/aaai.v40i25.39225

Abstract

Vision Transformers (ViTs) have gained significant attention and widespread adoption due to their impressive performance in various computer vision tasks. However, in practice, their substantial computational overhead often leads to high inference latency and increased overheads when deployed on resource-constrained edge devices like smartphones, autonomous vehicles, and robots. To address these challenges, Early Exit (EE) has emerged as a promising approach for lightweight inference on edge devices. It accelerates inference and reduces computational overhead by adaptively producing predictions through early exits based on sample complexity. Existing EE methods typically suffer from substantial accuracy decreases in late exits while providing only marginal accuracy improvements to early exits. This paper presents EnViT, an exit-aware structured dropout-enabled self-distillation approach that enhances the performance of early exits without compromising late exits. EnViT leverages structured dropout to enable self-distillation, where the full model serves as the teacher and its own virtual sub-models generated by structured dropout as students. This mechanism effectively distills knowledge from the full model to early exits and avoids performance degradation in late exits by mitigating parameter conflicts across exits during training. Evaluation on five datasets shows that our EnViT achieves accuracy improvements ranging from 0.36% to 7.92% while maintaining competitive speed-up ratios of 1.72x to 2.23x.

EnViT: Enhancing the Performance of Early-Exit Vision Transformers via Exit-Aware Structured Dropout-Enabled Self-Distillation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information