SALR: Sparsity-Aware Low-Rank Representation for Efficient Fine-Tuning of Large Language Models

Longteng Zhang; Sen Wu; Shuai Hou; Zhengyu Qing; Zhuo Zheng; Danning Ke; Qihong Lin; Qiang Wang; Shaohuai Shi; Xiaowen Chu

doi:10.1609/aaai.v40i33.40062

Authors

Longteng Zhang The Hong Kong University of Science and Technology (Guangzhou)
Sen Wu Harbin Institute of Technology, Shenzhen
Shuai Hou Harbin Institute of Technology, Shenzhen
Zhengyu Qing Harbin Institute of Technology, Shenzhen
Zhuo Zheng Huawei Technologies
Danning Ke Huawei Technologies
Qihong Lin Huawei Technologies
Qiang Wang Harbin Institute of Technology, Shenzhen
Shaohuai Shi Harbin Institute of Technology, Shenzhen
Xiaowen Chu The Hong Kong University of Science and Technology (Guangzhou) The Hong Kong University of Science and Technology

DOI:

https://doi.org/10.1609/aaai.v40i33.40062

Abstract

Adapting large pre-trained language models to downstream tasks often entails fine-tuning millions of parameters or deploying costly dense weight updates, which hinders their use in resource-constrained environments. Low-rank Adaptation (LoRA) reduces trainable parameters by factorizing weight updates, yet the underlying dense weights still impose high storage and computation costs. Magnitude-based pruning can yield sparse models but typically degrades LoRA’s performance when applied naively. In this paper, we introduce SALR (Sparsity-Aware Low-Rank Representation), a novel fine-tuning paradigm that unifies low-rank adaptation with sparse pruning under a rigorous mean-squared-error framework. We prove that statically pruning only the frozen base weights minimizes the pruning error bound, and we recover the discarded residual information via a truncated-SVD low-rank adapter, which provably reduces per-entry MSE by a factor of (1 - r/min(d, k)). To maximize hardware efficiency, we fuse multiple low-rank adapters into a single concatenated GEMM, and we adopt a bitmap-based encoding with a two-stage pipelined decoding + GEMM design to achieve true model compression and speedup. Empirically, SALR attains 50% sparsity on various LLMs while matching the performance of LoRA on GSM8K and MMLU, reduces model size by 2x, and delivers up to a 1.7x inference speedup.

SALR: Sparsity-Aware Low-Rank Representation for Efficient Fine-Tuning of Large Language Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information