SALR: Sparsity-Aware Low-Rank Representation for Efficient Fine-Tuning of Large Language Models
DOI:
https://doi.org/10.1609/aaai.v40i33.40062Abstract
Adapting large pre-trained language models to downstream tasks often entails fine-tuning millions of parameters or deploying costly dense weight updates, which hinders their use in resource-constrained environments. Low-rank Adaptation (LoRA) reduces trainable parameters by factorizing weight updates, yet the underlying dense weights still impose high storage and computation costs. Magnitude-based pruning can yield sparse models but typically degrades LoRA’s performance when applied naively. In this paper, we introduce SALR (Sparsity-Aware Low-Rank Representation), a novel fine-tuning paradigm that unifies low-rank adaptation with sparse pruning under a rigorous mean-squared-error framework. We prove that statically pruning only the frozen base weights minimizes the pruning error bound, and we recover the discarded residual information via a truncated-SVD low-rank adapter, which provably reduces per-entry MSE by a factor of (1 - r/min(d, k)). To maximize hardware efficiency, we fuse multiple low-rank adapters into a single concatenated GEMM, and we adopt a bitmap-based encoding with a two-stage pipelined decoding + GEMM design to achieve true model compression and speedup. Empirically, SALR attains 50% sparsity on various LLMs while matching the performance of LoRA on GSM8K and MMLU, reduces model size by 2x, and delivers up to a 1.7x inference speedup.Published
2026-03-14
How to Cite
Zhang, L., Wu, S., Hou, S., Qing, Z., Zheng, Z., Ke, D., … Chu, X. (2026). SALR: Sparsity-Aware Low-Rank Representation for Efficient Fine-Tuning of Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(33), 28337–28345. https://doi.org/10.1609/aaai.v40i33.40062
Issue
Section
AAAI Technical Track on Machine Learning X