V-Pruner: A Fast and Globally-informed Token Pruning Framework for Vision Transformer
DOI:
https://doi.org/10.1609/aaai.v40i40.40737Abstract
Vision Transformer (ViT) has become one of the cornerstones of the computer vision field, demonstrating exceptional performance. However, its inherent high computational complexity and inference latency still pose significant obstacles for deployment in resource-constrained environments. Token pruning, by removing less informative tokens, offers an effective strategy to reduce computational overhead. However, existing pruning methods largely rely on static or local token importance scores. This myopic approach fundamentally overlooks the sequential dependency of pruning decisions and fails to capture the interaction effects between pruning decisions across layers, often neglecting the global interactions between mask variables. To address this limitation, we propose V-Pruner, a fast and globally-informed token pruning framework for Vision Transformer. V-Pruner first leverages Fisher information to perform an initial assessment of token importance, providing a principled initial prior for pruning decisions. Building on this, V-Pruner introduces a Reinforcement Learning (RL) Proximal Policy Optimization (PPO) algorithm, refining token pruning into a global sequential decision process. The algorithm combines a composite reward signal that incorporates both model performance and computational cost to guide policy exploration, effectively evaluating the long-term impact of different pruning decision combinations on global model performance. Extensive experiments on ViT-L, DeiT-B, DeiT-S, and DeiT-T demonstrate that V-Pruner achieves a better balance between accuracy, GFLOPs, inference speed, and training time, surpassing existing mainstream ViT pruning algorithms in overall performance.Downloads
Published
2026-03-14
How to Cite
Yao, G., Zheng, J., Wang, Z., Zhang, W., Han, R., Zhao, C., … Liu, R. (2026). V-Pruner: A Fast and Globally-informed Token Pruning Framework for Vision Transformer. Proceedings of the AAAI Conference on Artificial Intelligence, 40(40), 34396–34404. https://doi.org/10.1609/aaai.v40i40.40737
Issue
Section
AAAI Technical Track on Natural Language Processing V