Dynamic Operator Optimization for Efficient Multi-Tenant LoRA Model Serving

Authors

  • Changhai Zhou Fudan University
  • Yuhua Zhou Zhejiang University
  • Shiyang Zhang Columbia Univeristy
  • Yibin Wang Fudan University
  • Zekai Liu Fudan University

DOI:

https://doi.org/10.1609/aaai.v39i21.34453

Abstract

Low-Rank Adaptation (LoRA) has become increasingly popular for efficiently fine-tuning large language models (LLMs) with minimal resources. However, traditional methods that serve multiple LoRA models independently result in redundant computation and low GPU utilization. This paper addresses these inefficiencies by introducing Dynamic Operator Optimization (Dop), an advanced automated optimization technique designed to dynamically optimize the Segmented Gather Matrix-Vector Multiplication (SGMV) operator based on specific scenarios. SGMV's unique design enables batching GPU operations for different LoRA models, significantly improving computational efficiency. The Dop approach leverages a Search Space Constructor to create a hierarchical search space, dividing the program space into high-level structural sketches and low-level implementation details, ensuring diversity and flexibility in operator implementation. Furthermore, an Optimization Engine refines these implementations using evolutionary search, guided by a cost model that estimates program performance. This iterative optimization process ensures that SGMV implementations can dynamically adapt to different scenarios to maintain high performance. We demonstrate that Dop can improve throughput by 1.30-1.46 times in a SOTA multi-tenant LoRA serving.

Downloads

Published

2025-04-11

How to Cite

Zhou, C., Zhou, Y., Zhang, S., Wang, Y., & Liu, Z. (2025). Dynamic Operator Optimization for Efficient Multi-Tenant LoRA Model Serving. Proceedings of the AAAI Conference on Artificial Intelligence, 39(21), 22910–22918. https://doi.org/10.1609/aaai.v39i21.34453

Issue

Section

AAAI Technical Track on Machine Learning VII