EnViT: Enhancing the Performance of Early-Exit Vision Transformers via Exit-Aware Structured Dropout-Enabled Self-Distillation

Authors

  • Yonghao Dong National Engineering Research Center for Big Data Technology and System, Wuhan, China Services Computing Technology and System Lab, Wuhan, China Cluster and Grid Computing Lab, Wuhan, China School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China
  • Qiang He National Engineering Research Center for Big Data Technology and System, Wuhan, China Services Computing Technology and System Lab, Wuhan, China Cluster and Grid Computing Lab, Wuhan, China School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China Swinburne University of Technology, Melbourne, Australia
  • Penghong Rui National Engineering Research Center for Big Data Technology and System, Wuhan, China Services Computing Technology and System Lab, Wuhan, China Cluster and Grid Computing Lab, Wuhan, China School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China
  • Zhenzhe Zheng Shanghai Jiao Tong University, Shanghai, China
  • Zhao Li Zhejiang Lab, Hangzhou, China
  • Feifei Chen Deakin University, Melbourne, Australia
  • Hai Jin National Engineering Research Center for Big Data Technology and System, Wuhan, China Services Computing Technology and System Lab, Wuhan, China Cluster and Grid Computing Lab, Wuhan, China School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China
  • Yun Yang Swinburne University of Technology, Melbourne, Australia

DOI:

https://doi.org/10.1609/aaai.v40i25.39225

Abstract

Vision Transformers (ViTs) have gained significant attention and widespread adoption due to their impressive performance in various computer vision tasks. However, in practice, their substantial computational overhead often leads to high inference latency and increased overheads when deployed on resource-constrained edge devices like smartphones, autonomous vehicles, and robots. To address these challenges, Early Exit (EE) has emerged as a promising approach for lightweight inference on edge devices. It accelerates inference and reduces computational overhead by adaptively producing predictions through early exits based on sample complexity. Existing EE methods typically suffer from substantial accuracy decreases in late exits while providing only marginal accuracy improvements to early exits. This paper presents EnViT, an exit-aware structured dropout-enabled self-distillation approach that enhances the performance of early exits without compromising late exits. EnViT leverages structured dropout to enable self-distillation, where the full model serves as the teacher and its own virtual sub-models generated by structured dropout as students. This mechanism effectively distills knowledge from the full model to early exits and avoids performance degradation in late exits by mitigating parameter conflicts across exits during training. Evaluation on five datasets shows that our EnViT achieves accuracy improvements ranging from 0.36% to 7.92% while maintaining competitive speed-up ratios of 1.72x to 2.23x.

Downloads

Published

2026-03-14

How to Cite

Dong, Y., He, Q., Rui, P., Zheng, Z., Li, Z., Chen, F., … Yang, Y. (2026). EnViT: Enhancing the Performance of Early-Exit Vision Transformers via Exit-Aware Structured Dropout-Enabled Self-Distillation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(25), 20852–20860. https://doi.org/10.1609/aaai.v40i25.39225

Issue

Section

AAAI Technical Track on Machine Learning II