Dynamic Operator Optimization for Efficient Multi-Tenant LoRA Model Serving

Changhai Zhou; Yuhua Zhou; Shiyang Zhang; Yibin Wang; Zekai Liu

doi:10.1609/aaai.v39i21.34453

Authors

Changhai Zhou Fudan University
Yuhua Zhou Zhejiang University
Shiyang Zhang Columbia Univeristy
Yibin Wang Fudan University
Zekai Liu Fudan University

DOI:

https://doi.org/10.1609/aaai.v39i21.34453

Abstract

Low-Rank Adaptation (LoRA) has become increasingly popular for efficiently fine-tuning large language models (LLMs) with minimal resources. However, traditional methods that serve multiple LoRA models independently result in redundant computation and low GPU utilization. This paper addresses these inefficiencies by introducing Dynamic Operator Optimization (Dop), an advanced automated optimization technique designed to dynamically optimize the Segmented Gather Matrix-Vector Multiplication (SGMV) operator based on specific scenarios. SGMV's unique design enables batching GPU operations for different LoRA models, significantly improving computational efficiency. The Dop approach leverages a Search Space Constructor to create a hierarchical search space, dividing the program space into high-level structural sketches and low-level implementation details, ensuring diversity and flexibility in operator implementation. Furthermore, an Optimization Engine refines these implementations using evolutionary search, guided by a cost model that estimates program performance. This iterative optimization process ensures that SGMV implementations can dynamically adapt to different scenarios to maintain high performance. We demonstrate that Dop can improve throughput by 1.30-1.46 times in a SOTA multi-tenant LoRA serving.

Dynamic Operator Optimization for Efficient Multi-Tenant LoRA Model Serving

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information