Sliding-Window Merging for Compacting Patch-Redundant Layers in LLMs

Authors

  • Xuan Ding Shenzhen Future Network of Intelligence Institute and Guangdong Provincial Key Laboratory of Future Networks of Intelligence, The Chinese University of Hong Kong (Shenzhen) School of Science and Engineering, The Chinese University of Hong Kong (Shenzhen)
  • Rui Sun Shenzhen Future Network of Intelligence Institute and Guangdong Provincial Key Laboratory of Future Networks of Intelligence, The Chinese University of Hong Kong (Shenzhen) School of Science and Engineering, The Chinese University of Hong Kong (Shenzhen)
  • Yunjian Zhang University of Chinese Academy of Sciences
  • Xiu Yan Meituan Group
  • Yueqi Zhou Beijing Normal University
  • Kaihao Huang Beijing Normal University
  • Suzhong Fu Shenzhen Future Network of Intelligence Institute and Guangdong Provincial Key Laboratory of Future Networks of Intelligence, The Chinese University of Hong Kong (Shenzhen) School of Science and Engineering, The Chinese University of Hong Kong (Shenzhen)
  • Angelica I Aviles-Rivero Tsinghua University
  • Chuanlong Xie Beijing Normal University
  • Yao Zhu Zhejiang University

DOI:

https://doi.org/10.1609/aaai.v40i25.39222

Abstract

Depth-wise pruning accelerates LLM inference in resource-constrained scenarios but suffers from performance degradation due to indiscriminate removal of entire Transformer layers. This paper reveals ``Patch-Like'' redundancy across layers via correlation analysis of the outputs of different layers in reproducing kernel Hilbert space, demonstrating consecutive layers exhibit high functional similarity. Building on this observation, this paper proposes Sliding-Window Merging (SWM) - a dynamic compression method that selects consecutive layers from top to bottom using a pre-defined similarity threshold, and compacts patch-redundant layers through a parameter consolidation, thereby simplifying the model structure while maintaining its performance. Extensive experiments on LLMs with various architectures and different parameter scales show that our method outperforms existing pruning techniques in both zero-shot inference performance and retraining recovery quality after pruning. In particular, in the experiment with 35\% pruning on the Vicuna-7B model, our method achieved a 1.654\% improvement in average performance on zero-shot tasks compared to the existing method. Moreover, we further reveal the potential of combining depth pruning with width pruning to enhance the pruning effect.

Downloads

Published

2026-03-14

How to Cite

Ding, X., Sun, R., Zhang, Y., Yan, X., Zhou, Y., Huang, K., … Zhu, Y. (2026). Sliding-Window Merging for Compacting Patch-Redundant Layers in LLMs. Proceedings of the AAAI Conference on Artificial Intelligence, 40(25), 20826–20834. https://doi.org/10.1609/aaai.v40i25.39222

Issue

Section

AAAI Technical Track on Machine Learning II