FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Models

Zishan Shao; Yixiao Wang; Qinsi Wang; Ting Jiang; Zhixu Du; Hancheng Ye; Danyang Zhuo; Yiran Chen; Hai ¨Helen¨ Li

doi:10.1609/aaai.v40i30.39720

Authors

Zishan Shao Department of Statistical Science, Duke University Department of Electrical & Computer Engineering, Duke University
Yixiao Wang Department of Statistical Science, Duke University
Qinsi Wang Department of Electrical & Computer Engineering, Duke University
Ting Jiang Department of Electrical & Computer Engineering, Duke University
Zhixu Du Department of Electrical & Computer Engineering, Duke University
Hancheng Ye Department of Electrical & Computer Engineering, Duke University
Danyang Zhuo Department of Computer Science, Duke University
Yiran Chen Department of Electrical & Computer Engineering, Duke University
Hai ¨Helen¨ Li Department of Electrical & Computer Engineering, Duke University

DOI:

https://doi.org/10.1609/aaai.v40i30.39720

Abstract

Singular Value Decomposition (SVD) has recently gained traction as an effective compression technique for large language models (LLMs), with many studies reporting 20-80% parameter reduction at minimal accuracy cost. However, despite reducing weight memory, existing SVD-based approaches still rely on standard dense CUDA kernels during inference, which incur substantial-and ultimately unnecessary-activation memory overhead. Our analysis reveals that this kernel-induced cost, which grows with sequence length and hidden size, in worst case prevents any real reduction in peak inference memory, limiting the practical impact of SVD compression for on-device deployment. To address this bottleneck, we propose FlashSVD, an end-to-end, rank-aware streaming inference framework for SVD-compressed LLMs. FlashSVD integrates seamlessly with any SVD-based model and directly fuses low-rank projection kernels into self-attention and feed-forward pipelines. This design avoids materializing large activation buffers by streaming small tiles of truncated factors through on-chip SRAM, performing on-the-fly multiplication and reduction, and immediately evicting results–thus preserving high GPU occupancy without introducing latency. On standard benchmarks (e.g., BERT-Base), FlashSVD reduces peak activation memory by up to 70.2% and transient memory by 75%, with zero accuracy loss against low-rank baselines, enabling truly memory-efficient deployment of low-rank LLMs.

FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information