Efficient Plug-and-Play Weight Refinement for Sparse Large Models

Jingcheng Xie; Yinda Chen; Xiaoyu Liu; Yinglong Li; Haoyuan Shi; Zhiwei Xiong

doi:10.1609/aaai.v40i32.39922

Authors

Jingcheng Xie University of Science and Technology of China
Yinda Chen University of Science and Technology of China
Xiaoyu Liu University of Science and Technology of China
Yinglong Li University of Science and Technology of China
Haoyuan Shi University of Science and Technology of China
Zhiwei Xiong University of Science and Technology of China

DOI:

https://doi.org/10.1609/aaai.v40i32.39922

Abstract

One-shot pruning efficiently compresses Large Language Models but produces coarse sparse weights, causing significant performance degradation. Traditional fine-tuning approaches to refine these weights are prohibitively expensive for large models. This highlights the need for a training-free weight refinement method that works seamlessly with one-shot pruning and can efficiently recover the lost performance. To tackle this problem, we propose Efficient Iterative Weight Refinement (EIWR), a lightweight, plug-and-play, and training-free method that refines pruned weights through layer-wise iterative optimization. EIWR achieves efficient weight refinement via three key components: a Global Soft Constraint that eliminates costly row-wise Hessian inversions and expands the solution space; a Historical Momentum Strategy that leverages one-shot pruning priors to accelerate convergence and enhance final performance; and Neumann Series Extrapolation that significantly speeds up per-iteration computation. As a result, EIWR enables effective weight refinement with minimal time and memory overhead. Extensive experiments on LLaMA2/3 and Qwen under different pruning strategies and sparsity levels demonstrate that our method can efficiently refine sparse weights and mitigate performance degradation. For example, on LLaMA2-7B under 70 percent sparsity, EIWR reduces perplexity by 15 percent compared with SparseGPT on the WikiText2 benchmark, with only 1.81 additional minutes of computation and 1GB of additional memory.

Efficient Plug-and-Play Weight Refinement for Sparse Large Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information