QiMeng-GEMM: Automatically Generating High-Performance Matrix Multiplication Code by Exploiting Large Language Models

Authors

  • Qirui Zhou State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China
  • Yuanbo Wen State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
  • Ruizhi Chen Intelligent Software Research Center, Institute of Software, CAS, Beijing, China
  • Ke Gao Intelligent Software Research Center, Institute of Software, CAS, Beijing, China
  • Weiqiang Xiong Intelligent Software Research Center, Institute of Software, CAS, Beijing, China University of Chinese Academy of Sciences, Beijing, China
  • Ling Li Intelligent Software Research Center, Institute of Software, CAS, Beijing, China University of Chinese Academy of Sciences, Beijing, China
  • Qi Guo State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
  • Yanjun Wu Intelligent Software Research Center, Institute of Software, CAS, Beijing, China
  • Yunji Chen State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China Institute of AI for Industries, CAS, China

DOI:

https://doi.org/10.1609/aaai.v39i21.34461

Abstract

As a crucial operator in numerous scientific and engineering computing applications, the automatic optimization of General Matrix Multiplication (GEMM) with full utilization of ever-evolving hardware architectures (e.g. GPUs and RISC-V) is of paramount importance. While Large Language Models (LLMs) can generate functionally correct code for simple tasks, they have yet to produce high-performance code. The key challenge resides in deeply understanding diverse hardware architectures and crafting prompts that effectively unleash the potential of LLMs to generate high-performance code. In this paper, we propose a novel prompt mechanism called QiMeng-GEMM which enables LLMs to comprehend the architectural characteristics of different hardware platforms and automatically search for the optimization combinations for GEMM. The key of QiMeng-GEMM is a set of informative, adaptive, and iterative meta-prompts. Based on this, a searching strategy for optimal combinations of meta-prompts is used to iteratively generate high-performance code. Extensive experiments conducted on 4 leading LLMs, various paradigmatic hardware platforms, and representative matrix dimensions unequivocally demonstrate QiMeng-GEMM’s superior performance in auto-generating optimized GEMM code. Compared to vanilla prompts, our method achieves a performance enhancement of up to 113×. Even when compared to human experts, our method can reach 115% of cuBLAS on NVIDIA GPUs and 211% of OpenBLAS on RISC-V CPUs. Notably, while human experts often take months to optimize GEMM, our approach reduces the development cost by over 240×.

Downloads

Published

2025-04-11

How to Cite

Zhou, Q., Wen, Y., Chen, R., Gao, K., Xiong, W., Li, L., … Chen, Y. (2025). QiMeng-GEMM: Automatically Generating High-Performance Matrix Multiplication Code by Exploiting Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 39(21), 22982–22990. https://doi.org/10.1609/aaai.v39i21.34461

Issue

Section

AAAI Technical Track on Machine Learning VII