D2 Prune: Sparsifying Large Language Models via Dual Taylor Expansion and Attention Distribution Awareness
DOI:
https://doi.org/10.1609/aaai.v40i32.39932Abstract
Large language models (LLMs) face significant deployment challenges due to their massive computational demands. While pruning offers a promising compression solution, existing methods suffer from two critical limitations: (1) They neglect activation distribution shifts between calibration data and test data, resulting in inaccurate error estimations; (2) Overlooking the long-tail distribution characteristics of activations in the attention module. To address these limitations, this paper proposes a novel pruning method, D²Prune. First, we propose a dual Taylor expansion-based method that jointly models weight and activation perturbations for precise error estimation, leading to precise pruning mask selection and weight updating and facilitating error minimization during pruning. Second, we propose an attention-aware dynamic update strategy that preserves the long-tail attention pattern by jointly minimizing the KL divergence of attention distributions and the reconstruction error. Extensive experiments show that D²Prune consistently outperforms SOTA methods across various LLMs (e.g., OPT-125M, LLaMA2/3, Qwen3). Moreover, the dynamic attention update mechanism also generalizes well to ViT-based vision models like DeiT, achieving superior accuracy on ImageNet-1K.Downloads
Published
2026-03-14
How to Cite
Xiong, L., Liu, N., Ren, A., Bai, Y., Fang, H., Zhang, B., … Liu, D. (2026). D2 Prune: Sparsifying Large Language Models via Dual Taylor Expansion and Attention Distribution Awareness. Proceedings of the AAAI Conference on Artificial Intelligence, 40(32), 27171–27179. https://doi.org/10.1609/aaai.v40i32.39932
Issue
Section
AAAI Technical Track on Machine Learning IX