V-Pruner: A Fast and Globally-informed Token Pruning Framework for Vision Transformer

Authors

  • Guangzhen Yao Northeast Normal University
  • Jiayun Zheng University of Michigan
  • Zezhou Wang Australian National University
  • Wenxin Zhang Northeast Normal University
  • Renda Han Northeast Normal University
  • Chuangxin Zhao Northeast Normal University
  • Zeyu Zhang Northeast Normal University
  • Runhao Liu Northeast Normal University

DOI:

https://doi.org/10.1609/aaai.v40i40.40737

Abstract

Vision Transformer (ViT) has become one of the cornerstones of the computer vision field, demonstrating exceptional performance. However, its inherent high computational complexity and inference latency still pose significant obstacles for deployment in resource-constrained environments. Token pruning, by removing less informative tokens, offers an effective strategy to reduce computational overhead. However, existing pruning methods largely rely on static or local token importance scores. This myopic approach fundamentally overlooks the sequential dependency of pruning decisions and fails to capture the interaction effects between pruning decisions across layers, often neglecting the global interactions between mask variables. To address this limitation, we propose V-Pruner, a fast and globally-informed token pruning framework for Vision Transformer. V-Pruner first leverages Fisher information to perform an initial assessment of token importance, providing a principled initial prior for pruning decisions. Building on this, V-Pruner introduces a Reinforcement Learning (RL) Proximal Policy Optimization (PPO) algorithm, refining token pruning into a global sequential decision process. The algorithm combines a composite reward signal that incorporates both model performance and computational cost to guide policy exploration, effectively evaluating the long-term impact of different pruning decision combinations on global model performance. Extensive experiments on ViT-L, DeiT-B, DeiT-S, and DeiT-T demonstrate that V-Pruner achieves a better balance between accuracy, GFLOPs, inference speed, and training time, surpassing existing mainstream ViT pruning algorithms in overall performance.

Published

2026-03-14

How to Cite

Yao, G., Zheng, J., Wang, Z., Zhang, W., Han, R., Zhao, C., … Liu, R. (2026). V-Pruner: A Fast and Globally-informed Token Pruning Framework for Vision Transformer. Proceedings of the AAAI Conference on Artificial Intelligence, 40(40), 34396–34404. https://doi.org/10.1609/aaai.v40i40.40737

Issue

Section

AAAI Technical Track on Natural Language Processing V