Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models
DOI:
https://doi.org/10.1609/aaai.v40i9.37673Abstract
Large vision-language models (LVLMs) excel at visual understanding but face efficiency challenges due to quadratic complexity when processing long multimodal contexts. While token compression can reduce computational costs, existing approaches are designed for single-view LVLMs and fail to account for the unique multi-view characteristics of high-resolution LVLMs that use dynamic cropping. Current methods treat all tokens uniformly, yet our analysis shows that global thumbnails can naturally guide the compression of local crops by providing holistic context for evaluating informativeness. In this paper, we first analyze the dynamic cropping strategy, revealing both the complementary relationship between thumbnails and crops and the distinct characteristics across different crops. Based on these insights, we propose ''Global Compression Commander'' (GlobalCom2), a novel plug-and-play token compression framework for high-resolution LVLMs. GlobalCom2 uses the thumbnail as a ''commander'' to adaptively guide the compression of local crops, preserving informative details while removing redundancy. Extensive experiments demonstrate that GlobalCom2 maintains over 90% of model performance while compressing 90% of visual tokens, reducing FLOPs to 9.1% and peak memory usage to 60% of the original.Downloads
Published
2026-03-14
How to Cite
Liu, X., Wang, Z., Chen, J., Han, Y., Wang, Y., Yuan, J., Song, J., Huang, S., & Chen, H. (2026). Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(9), 7350-7358. https://doi.org/10.1609/aaai.v40i9.37673
Issue
Section
AAAI Technical Track on Computer Vision VI