V-Pruner: A Fast and Globally-informed Token Pruning Framework for Vision Transformer

Guangzhen Yao; Jiayun Zheng; Zezhou Wang; Wenxin Zhang; Renda Han; Chuangxin Zhao; Zeyu Zhang; Runhao Liu

doi:10.1609/aaai.v40i40.40737

Authors

Guangzhen Yao Northeast Normal University
Jiayun Zheng University of Michigan
Zezhou Wang Australian National University
Wenxin Zhang Northeast Normal University
Renda Han Northeast Normal University
Chuangxin Zhao Northeast Normal University
Zeyu Zhang Northeast Normal University
Runhao Liu Northeast Normal University

DOI:

https://doi.org/10.1609/aaai.v40i40.40737

Abstract

Vision Transformer (ViT) has become one of the cornerstones of the computer vision field, demonstrating exceptional performance. However, its inherent high computational complexity and inference latency still pose significant obstacles for deployment in resource-constrained environments. Token pruning, by removing less informative tokens, offers an effective strategy to reduce computational overhead. However, existing pruning methods largely rely on static or local token importance scores. This myopic approach fundamentally overlooks the sequential dependency of pruning decisions and fails to capture the interaction effects between pruning decisions across layers, often neglecting the global interactions between mask variables. To address this limitation, we propose V-Pruner, a fast and globally-informed token pruning framework for Vision Transformer. V-Pruner first leverages Fisher information to perform an initial assessment of token importance, providing a principled initial prior for pruning decisions. Building on this, V-Pruner introduces a Reinforcement Learning (RL) Proximal Policy Optimization (PPO) algorithm, refining token pruning into a global sequential decision process. The algorithm combines a composite reward signal that incorporates both model performance and computational cost to guide policy exploration, effectively evaluating the long-term impact of different pruning decision combinations on global model performance. Extensive experiments on ViT-L, DeiT-B, DeiT-S, and DeiT-T demonstrate that V-Pruner achieves a better balance between accuracy, GFLOPs, inference speed, and training time, surpassing existing mainstream ViT pruning algorithms in overall performance.

V-Pruner: A Fast and Globally-informed Token Pruning Framework for Vision Transformer

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information