Unified Knowledge Maintenance Pruning and Progressive Recovery with Weight Recalling for Large Vision-Language Models
DOI:
https://doi.org/10.1609/aaai.v39i8.32923Abstract
Large Vision-Language Model (LVLM), leveraging Large Language Model (LLM) as the cognitive core, has recently become one of the most representative multimodal model paradigms. However, with the expansion of unimodal branches, \emph{i.e.} visual encoder and LLM, the storage and computational burdens intensify, posing challenges for deployment. Structured pruning has proved promising in compressing large models by trimming a large portion of insignificant network structures. Nevertheless, most of them are predominantly designed for LLMs, either relying on unitary importance metrics that fail to deal with modality-wise imbalances or adopting generic pruning and recovery paradigms that overlook the unique calibration status and capability requirements of large models, leading to substantial performance degradation. To address these issues, we propose a novel structured pruning approach for LVLMs, dubbed Unified Knowledge Maintenance Pruning and Progressive Recovery with Weight Recalling (UKMP). Specifically, we design a Unified Knowledge Maintenance Importance (UKMI) metric, which simultaneously considers balancing the block-wise and modality-wise importance by adaptive normalization, optimizing the importance estimation by refining gradient-based criteria, and maintaining the knowledge capacity of LVLMs by using the angle distribution information entropy. Moreover, we develop a LoRA-based Progressive Distillation (LPD) method that recalls the pruned weights and performs progressive distillation for comprehensive recovery. Extensive experimental results across various vision-language tasks demonstrate the effectiveness of our approach, comparing to the state-of-the-art structured pruning methods.Downloads
Published
2025-04-11
How to Cite
Wu, Z., Chen, J., & Wang, Y. (2025). Unified Knowledge Maintenance Pruning and Progressive Recovery with Weight Recalling for Large Vision-Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 39(8), 8550-8558. https://doi.org/10.1609/aaai.v39i8.32923
Issue
Section
AAAI Technical Track on Computer Vision VII