Sliding-Window Merging for Compacting Patch-Redundant Layers in LLMs

Xuan Ding; Rui Sun; Yunjian Zhang; Xiu Yan; Yueqi Zhou; Kaihao Huang; Suzhong Fu; Angelica I Aviles-Rivero; Chuanlong Xie; Yao Zhu

doi:10.1609/aaai.v40i25.39222

Authors

Xuan Ding Shenzhen Future Network of Intelligence Institute and Guangdong Provincial Key Laboratory of Future Networks of Intelligence, The Chinese University of Hong Kong (Shenzhen) School of Science and Engineering, The Chinese University of Hong Kong (Shenzhen)
Rui Sun Shenzhen Future Network of Intelligence Institute and Guangdong Provincial Key Laboratory of Future Networks of Intelligence, The Chinese University of Hong Kong (Shenzhen) School of Science and Engineering, The Chinese University of Hong Kong (Shenzhen)
Yunjian Zhang University of Chinese Academy of Sciences
Xiu Yan Meituan Group
Yueqi Zhou Beijing Normal University
Kaihao Huang Beijing Normal University
Suzhong Fu Shenzhen Future Network of Intelligence Institute and Guangdong Provincial Key Laboratory of Future Networks of Intelligence, The Chinese University of Hong Kong (Shenzhen) School of Science and Engineering, The Chinese University of Hong Kong (Shenzhen)
Angelica I Aviles-Rivero Tsinghua University
Chuanlong Xie Beijing Normal University
Yao Zhu Zhejiang University

DOI:

https://doi.org/10.1609/aaai.v40i25.39222

Abstract

Depth-wise pruning accelerates LLM inference in resource-constrained scenarios but suffers from performance degradation due to indiscriminate removal of entire Transformer layers. This paper reveals ``Patch-Like'' redundancy across layers via correlation analysis of the outputs of different layers in reproducing kernel Hilbert space, demonstrating consecutive layers exhibit high functional similarity. Building on this observation, this paper proposes Sliding-Window Merging (SWM) - a dynamic compression method that selects consecutive layers from top to bottom using a pre-defined similarity threshold, and compacts patch-redundant layers through a parameter consolidation, thereby simplifying the model structure while maintaining its performance. Extensive experiments on LLMs with various architectures and different parameter scales show that our method outperforms existing pruning techniques in both zero-shot inference performance and retraining recovery quality after pruning. In particular, in the experiment with 35\% pruning on the Vicuna-7B model, our method achieved a 1.654\% improvement in average performance on zero-shot tasks compared to the existing method. Moreover, we further reveal the potential of combining depth pruning with width pruning to enhance the pruning effect.

Sliding-Window Merging for Compacting Patch-Redundant Layers in LLMs

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information