FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Models
DOI:
https://doi.org/10.1609/aaai.v40i30.39720Abstract
Singular Value Decomposition (SVD) has recently gained traction as an effective compression technique for large language models (LLMs), with many studies reporting 20-80% parameter reduction at minimal accuracy cost. However, despite reducing weight memory, existing SVD-based approaches still rely on standard dense CUDA kernels during inference, which incur substantial-and ultimately unnecessary-activation memory overhead. Our analysis reveals that this kernel-induced cost, which grows with sequence length and hidden size, in worst case prevents any real reduction in peak inference memory, limiting the practical impact of SVD compression for on-device deployment. To address this bottleneck, we propose FlashSVD, an end-to-end, rank-aware streaming inference framework for SVD-compressed LLMs. FlashSVD integrates seamlessly with any SVD-based model and directly fuses low-rank projection kernels into self-attention and feed-forward pipelines. This design avoids materializing large activation buffers by streaming small tiles of truncated factors through on-chip SRAM, performing on-the-fly multiplication and reduction, and immediately evicting results–thus preserving high GPU occupancy without introducing latency. On standard benchmarks (e.g., BERT-Base), FlashSVD reduces peak activation memory by up to 70.2% and transient memory by 75%, with zero accuracy loss against low-rank baselines, enabling truly memory-efficient deployment of low-rank LLMs.Downloads
Published
2026-03-14
How to Cite
Shao, Z., Wang, Y., Wang, Q., Jiang, T., Du, Z., Ye, H., … ¨Helen¨ Li, H. (2026). FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(30), 25278–25285. https://doi.org/10.1609/aaai.v40i30.39720
Issue
Section
AAAI Technical Track on Machine Learning VII