TOP-RL: Task-Optimized Progressive Token Pruning with Reinforcement Learning for Vision Language Models

Authors

  • Hengyi Wang State Key Laboratory of Integrated Services Networks, Xidian University
  • Weiying Xie State Key Laboratory of Integrated Services Networks, Xidian University
  • Hui Jiang State Key Laboratory of Integrated Services Networks, Xidian University
  • Yaotao Wei Beijing Institute of Technology
  • Kai Jiang State Key Laboratory of Integrated Services Networks, Xidian University
  • Mingxiang Cao State Key Laboratory of Integrated Services Networks, Xidian University
  • Chenhe Hao State Key Laboratory of Integrated Services Networks, Xidian University
  • Leyuan Fang College of Electric and Information Engineering, Hunan University

DOI:

https://doi.org/10.1609/aaai.v40i18.38614

Abstract

In recent years, Large Vision-Language Models (LVLMs) have significantly advanced multimodal tasks. However, their inference requires intensive processing of numerous visual tokens and incurs substantial computational overhead. Existing methods typically compress visual tokens either at the input stage or in early model layers, ignoring variations across tasks and depths. To address these limitations, we introduce TOP-RL, a Task-Optimized Progressive token pruning framework based on Reinforcement Learning. TOP-RL formulates visual token pruning as a multi-stage Markov Decision Process (MDP). It employs an agent trained with dense and fine-grained reward signals to progressively generate differentiable binary masks. This enables TOP-RL to adaptively select crucial visual tokens tailored to each task, effectively balancing accuracy and computational efficiency. Extensive experiments on leading multimodal datasets and advanced LVLMs validate that TOP-RL effectively learns task-optimized pruning policies, significantly boosting inference efficiency while preserving robust performance. For instance, LLaVA-NeXT equipped with TOP-RL achieves a 1.9x speedup in inference time and a 9.3x reduction in FLOPs, with 96% performance preserved.

Downloads

Published

2026-03-14

How to Cite

Wang, H., Xie, W., Jiang, H., Wei, Y., Jiang, K., Cao, M., … Fang, L. (2026). TOP-RL: Task-Optimized Progressive Token Pruning with Reinforcement Learning for Vision Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(18), 15824–15832. https://doi.org/10.1609/aaai.v40i18.38614

Issue

Section

AAAI Technical Track on Data Mining & Knowledge Management II