Speech Recognition Meets Large Language Model: Benchmarking, Models, and Exploration

Authors

  • Ziyang Ma MoE Key Lab of Artificial Intelligence, X-LANCE Lab, Shanghai Jiao Tong University
  • Guanrou Yang MoE Key Lab of Artificial Intelligence, X-LANCE Lab, Shanghai Jiao Tong University
  • Yifan Yang MoE Key Lab of Artificial Intelligence, X-LANCE Lab, Shanghai Jiao Tong University
  • Zhifu Gao Alibaba Group
  • Jiaming Wang Alibaba Group
  • Zhihao Du Alibaba Group
  • Fan Yu Alibaba Group
  • Qian Chen Alibaba Group
  • Siqi Zheng Alibaba Group
  • Shiliang Zhang Alibaba Group
  • Xie Chen MoE Key Lab of Artificial Intelligence, X-LANCE Lab, Shanghai Jiao Tong University

DOI:

https://doi.org/10.1609/aaai.v39i23.34666

Abstract

In this paper, we focus on prompting one of the most important tasks in the field of speech processing, i.e., automatic speech recognition (ASR), with speech foundation encoders and large language models (LLM). Despite the growing body of research in this area, we find that many crucial design decisions in LLM-based ASR systems are often inadequately justified. This lack of clarity impedes the field's progress, making it challenging to pinpoint which design choices truly improve model performance. To address these challenges, we conduct a comprehensive series of experiments that explore various aspects, leading to the optimal LLM-based ASR system. We found that delicate designs are not necessary, while a clean setup with little task-specific design is competent. The models achieve strong performance on the Librispeech and Gigaspeech datasets, compared to both LLM-based models and non-LLM-based models. Finally, we explore the capability emergence of LLM-based ASR in the process of modal alignment. We hope that our study can facilitate the research on extending LLM with cross-modality capacity and shed light on the LLM-based ASR community.

Downloads

Published

2025-04-11

How to Cite

Ma, Z., Yang, G., Yang, Y., Gao, Z., Wang, J., Du, Z., … Chen, X. (2025). Speech Recognition Meets Large Language Model: Benchmarking, Models, and Exploration. Proceedings of the AAAI Conference on Artificial Intelligence, 39(23), 24840–24848. https://doi.org/10.1609/aaai.v39i23.34666

Issue

Section

AAAI Technical Track on Natural Language Processing II