Unified Knowledge Maintenance Pruning and Progressive Recovery with Weight Recalling for Large Vision-Language Models

Zimeng Wu; Jiaxin Chen; Yunhong Wang

doi:10.1609/aaai.v39i8.32923

Authors

Zimeng Wu State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China School of Computer Science and Engineering, Beihang University, Beijing, China
Jiaxin Chen State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China School of Computer Science and Engineering, Beihang University, Beijing, China
Yunhong Wang State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China School of Computer Science and Engineering, Beihang University, Beijing, China

DOI:

https://doi.org/10.1609/aaai.v39i8.32923

Abstract

Large Vision-Language Model (LVLM), leveraging Large Language Model (LLM) as the cognitive core, has recently become one of the most representative multimodal model paradigms. However, with the expansion of unimodal branches, \emph{i.e.} visual encoder and LLM, the storage and computational burdens intensify, posing challenges for deployment. Structured pruning has proved promising in compressing large models by trimming a large portion of insignificant network structures. Nevertheless, most of them are predominantly designed for LLMs, either relying on unitary importance metrics that fail to deal with modality-wise imbalances or adopting generic pruning and recovery paradigms that overlook the unique calibration status and capability requirements of large models, leading to substantial performance degradation. To address these issues, we propose a novel structured pruning approach for LVLMs, dubbed Unified Knowledge Maintenance Pruning and Progressive Recovery with Weight Recalling (UKMP). Specifically, we design a Unified Knowledge Maintenance Importance (UKMI) metric, which simultaneously considers balancing the block-wise and modality-wise importance by adaptive normalization, optimizing the importance estimation by refining gradient-based criteria, and maintaining the knowledge capacity of LVLMs by using the angle distribution information entropy. Moreover, we develop a LoRA-based Progressive Distillation (LPD) method that recalls the pruned weights and performs progressive distillation for comprehensive recovery. Extensive experimental results across various vision-language tasks demonstrate the effectiveness of our approach, comparing to the state-of-the-art structured pruning methods.

Unified Knowledge Maintenance Pruning and Progressive Recovery with Weight Recalling for Large Vision-Language Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information