FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Models

Authors

  • Zishan Shao Department of Statistical Science, Duke University Department of Electrical & Computer Engineering, Duke University
  • Yixiao Wang Department of Statistical Science, Duke University
  • Qinsi Wang Department of Electrical & Computer Engineering, Duke University
  • Ting Jiang Department of Electrical & Computer Engineering, Duke University
  • Zhixu Du Department of Electrical & Computer Engineering, Duke University
  • Hancheng Ye Department of Electrical & Computer Engineering, Duke University
  • Danyang Zhuo Department of Computer Science, Duke University
  • Yiran Chen Department of Electrical & Computer Engineering, Duke University
  • Hai ¨Helen¨ Li Department of Electrical & Computer Engineering, Duke University

DOI:

https://doi.org/10.1609/aaai.v40i30.39720

Abstract

Singular Value Decomposition (SVD) has recently gained traction as an effective compression technique for large language models (LLMs), with many studies reporting 20-80% parameter reduction at minimal accuracy cost. However, despite reducing weight memory, existing SVD-based approaches still rely on standard dense CUDA kernels during inference, which incur substantial-and ultimately unnecessary-activation memory overhead. Our analysis reveals that this kernel-induced cost, which grows with sequence length and hidden size, in worst case prevents any real reduction in peak inference memory, limiting the practical impact of SVD compression for on-device deployment. To address this bottleneck, we propose FlashSVD, an end-to-end, rank-aware streaming inference framework for SVD-compressed LLMs. FlashSVD integrates seamlessly with any SVD-based model and directly fuses low-rank projection kernels into self-attention and feed-forward pipelines. This design avoids materializing large activation buffers by streaming small tiles of truncated factors through on-chip SRAM, performing on-the-fly multiplication and reduction, and immediately evicting results–thus preserving high GPU occupancy without introducing latency. On standard benchmarks (e.g., BERT-Base), FlashSVD reduces peak activation memory by up to 70.2% and transient memory by 75%, with zero accuracy loss against low-rank baselines, enabling truly memory-efficient deployment of low-rank LLMs.

Downloads

Published

2026-03-14

How to Cite

Shao, Z., Wang, Y., Wang, Q., Jiang, T., Du, Z., Ye, H., … ¨Helen¨ Li, H. (2026). FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(30), 25278–25285. https://doi.org/10.1609/aaai.v40i30.39720

Issue

Section

AAAI Technical Track on Machine Learning VII