SEAP: Sparse Expert Activation Pruning Unlocks the Brainpower of Large Language Models

Xun Liang; Hanyu Wang; Huayi Lai; Simin Niu; Shichao Song; Jiawei Yang; Jihao Zhao; Feiyu Xiong; Bo Tang; Zhiyu Li

doi:10.1609/aaai.v40i38.40463

Authors

Xun Liang Renmin University of China
Hanyu Wang Renmin University of China
Huayi Lai Renmin University of China
Simin Niu Renmin University of China
Shichao Song Renmin University of China
Jiawei Yang Renmin University of China
Jihao Zhao Renmin University of China
Feiyu Xiong Institute for Advanced Algorithms Research (Shanghai) MemTensor (Shanghai) Technology Co., Ltd.
Bo Tang Institute for Advanced Algorithms Research (Shanghai) MemTensor (Shanghai) Technology Co., Ltd.
Zhiyu Li Institute for Advanced Algorithms Research (Shanghai) MemTensor (Shanghai) Technology Co., Ltd.

DOI:

https://doi.org/10.1609/aaai.v40i38.40463

Abstract

Pruning is a promising approach to reduce the high inference cost of large language models (LLMs), but it often comes at the expense of performance. Motivated by the "functional localization" theory in neuroscience, we hypothesize that LLMs contain task-specific expert activation paths, where specific subsets of neurons are co-activated for particular tasks. This structure allows selective activation to preserve task performance while improving inference efficiency. We introduce Sparse Expert Activation Pruning (SEAP), a training-free pruning method for large language models. SEAP identifies task-relevant activation paths by analyzing the clustering patterns of hidden states and neuron activations on a multi-task calibration dataset. Cross-task transfer evaluations confirm the existence of such expert activation structures. SEAP constructs task-aware pruning masks by leveraging a task-expert calibration dataset, which provides representative samples across diverse tasks to reveal their activation signatures. It then employs a lightweight task router to dynamically select relevant computation paths based on the input task. This design significantly reduces inference cost without compromising accuracy. Experimental results show that SEAP retains model performance with only a 1.5% drop on most tasks at 20% sparsity, and at 50% sparsity, it surpasses strong pruning baselines such as WandA and FLAP by over 20%. These results highlight SEAP as a scalable and effective solution for efficient LLM inference.

SEAP: Sparse Expert Activation Pruning Unlocks the Brainpower of Large Language Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information