Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models

Xuyang Liu; Ziming Wang; Junjie Chen; Yuhang Han; Yingyao Wang; Jiale Yuan; Jun Song; Siteng Huang; Honggang Chen

doi:10.1609/aaai.v40i9.37673

Authors

Xuyang Liu Sichuan University
Ziming Wang Taobao & Tmall Group of Alibaba
Junjie Chen Sichuan University
Yuhang Han Westlake University
Yingyao Wang Taobao & Tmall Group of Alibaba
Jiale Yuan Taobao & Tmall Group of Alibaba
Jun Song Taobao & Tmall Group of Alibaba
Siteng Huang Zhejiang University
Honggang Chen Sichuan University Police Integration Computing Key Laboratory of Sichuan Province

DOI:

https://doi.org/10.1609/aaai.v40i9.37673

Abstract

Large vision-language models (LVLMs) excel at visual understanding but face efficiency challenges due to quadratic complexity when processing long multimodal contexts. While token compression can reduce computational costs, existing approaches are designed for single-view LVLMs and fail to account for the unique multi-view characteristics of high-resolution LVLMs that use dynamic cropping. Current methods treat all tokens uniformly, yet our analysis shows that global thumbnails can naturally guide the compression of local crops by providing holistic context for evaluating informativeness. In this paper, we first analyze the dynamic cropping strategy, revealing both the complementary relationship between thumbnails and crops and the distinct characteristics across different crops. Based on these insights, we propose ''Global Compression Commander'' (GlobalCom2), a novel plug-and-play token compression framework for high-resolution LVLMs. GlobalCom2 uses the thumbnail as a ''commander'' to adaptively guide the compression of local crops, preserving informative details while removing redundancy. Extensive experiments demonstrate that GlobalCom2 maintains over 90% of model performance while compressing 90% of visual tokens, reducing FLOPs to 9.1% and peak memory usage to 60% of the original.

Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information