HiPrune: Training-Free Visual Token Pruning via Hierarchical Attention in Vision-Language Models (Student Abstract)

Authors

  • Jizhihui Liu Harbin Institute of Technology, Shenzhen
  • Guangdao Zhu Harbin Institute of Technology, Shenzhen
  • Feiyi Du Harbin Institute of Technology, Shenzhen

DOI:

https://doi.org/10.1609/aaai.v40i48.42240

Abstract

Vision-Language Models (VLMs) encode images into lengthy sequences of visual tokens, leading to excessive computational overhead and limited inference efficiency. In this paper, we study the hierarchical attention pattern in vision encoders and propose HiPrune, a training-free and model-agnostic token Pruning framework for VLMs. We identify that middle layers in the vision encoder attend to object-centric regions, while deep layers capture global contextual features. Based on this observation, HiPrune selects tokens based on the attention score from the middle and deep layers. Our method requires no retraining and integrates seamlessly with any ViT-based VLM. Experiments demonstrate that HiPrune achieves outstanding pruning performance, maintaining a balance between efficiency and efficacy.

Published

2026-03-14

How to Cite

Liu, J., Zhu, G., & Du, F. (2026). HiPrune: Training-Free Visual Token Pruning via Hierarchical Attention in Vision-Language Models (Student Abstract). Proceedings of the AAAI Conference on Artificial Intelligence, 40(48), 41275–41277. https://doi.org/10.1609/aaai.v40i48.42240