QiMeng-GEMM: Automatically Generating High-Performance Matrix Multiplication Code by Exploiting Large Language Models

Qirui Zhou; Yuanbo Wen; Ruizhi Chen; Ke Gao; Weiqiang Xiong; Ling Li; Qi Guo; Yanjun Wu; Yunji Chen

doi:10.1609/aaai.v39i21.34461

Authors

Qirui Zhou State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China
Yuanbo Wen State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Ruizhi Chen Intelligent Software Research Center, Institute of Software, CAS, Beijing, China
Ke Gao Intelligent Software Research Center, Institute of Software, CAS, Beijing, China
Weiqiang Xiong Intelligent Software Research Center, Institute of Software, CAS, Beijing, China University of Chinese Academy of Sciences, Beijing, China
Ling Li Intelligent Software Research Center, Institute of Software, CAS, Beijing, China University of Chinese Academy of Sciences, Beijing, China
Qi Guo State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Yanjun Wu Intelligent Software Research Center, Institute of Software, CAS, Beijing, China
Yunji Chen State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China Institute of AI for Industries, CAS, China

DOI:

https://doi.org/10.1609/aaai.v39i21.34461

Abstract

As a crucial operator in numerous scientific and engineering computing applications, the automatic optimization of General Matrix Multiplication (GEMM) with full utilization of ever-evolving hardware architectures (e.g. GPUs and RISC-V) is of paramount importance. While Large Language Models (LLMs) can generate functionally correct code for simple tasks, they have yet to produce high-performance code. The key challenge resides in deeply understanding diverse hardware architectures and crafting prompts that effectively unleash the potential of LLMs to generate high-performance code. In this paper, we propose a novel prompt mechanism called QiMeng-GEMM which enables LLMs to comprehend the architectural characteristics of different hardware platforms and automatically search for the optimization combinations for GEMM. The key of QiMeng-GEMM is a set of informative, adaptive, and iterative meta-prompts. Based on this, a searching strategy for optimal combinations of meta-prompts is used to iteratively generate high-performance code. Extensive experiments conducted on 4 leading LLMs, various paradigmatic hardware platforms, and representative matrix dimensions unequivocally demonstrate QiMeng-GEMM’s superior performance in auto-generating optimized GEMM code. Compared to vanilla prompts, our method achieves a performance enhancement of up to 113×. Even when compared to human experts, our method can reach 115% of cuBLAS on NVIDIA GPUs and 211% of OpenBLAS on RISC-V CPUs. Notably, while human experts often take months to optimize GEMM, our approach reduces the development cost by over 240×.

QiMeng-GEMM: Automatically Generating High-Performance Matrix Multiplication Code by Exploiting Large Language Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information