Speech Recognition Meets Large Language Model: Benchmarking, Models, and Exploration

Ziyang Ma; Guanrou Yang; Yifan Yang; Zhifu Gao; Jiaming Wang; Zhihao Du; Fan Yu; Qian Chen; Siqi Zheng; Shiliang Zhang; Xie Chen

doi:10.1609/aaai.v39i23.34666

Authors

Ziyang Ma MoE Key Lab of Artificial Intelligence, X-LANCE Lab, Shanghai Jiao Tong University
Guanrou Yang MoE Key Lab of Artificial Intelligence, X-LANCE Lab, Shanghai Jiao Tong University
Yifan Yang MoE Key Lab of Artificial Intelligence, X-LANCE Lab, Shanghai Jiao Tong University
Zhifu Gao Alibaba Group
Jiaming Wang Alibaba Group
Zhihao Du Alibaba Group
Fan Yu Alibaba Group
Qian Chen Alibaba Group
Siqi Zheng Alibaba Group
Shiliang Zhang Alibaba Group
Xie Chen MoE Key Lab of Artificial Intelligence, X-LANCE Lab, Shanghai Jiao Tong University

DOI:

https://doi.org/10.1609/aaai.v39i23.34666

Abstract

In this paper, we focus on prompting one of the most important tasks in the field of speech processing, i.e., automatic speech recognition (ASR), with speech foundation encoders and large language models (LLM). Despite the growing body of research in this area, we find that many crucial design decisions in LLM-based ASR systems are often inadequately justified. This lack of clarity impedes the field's progress, making it challenging to pinpoint which design choices truly improve model performance. To address these challenges, we conduct a comprehensive series of experiments that explore various aspects, leading to the optimal LLM-based ASR system. We found that delicate designs are not necessary, while a clean setup with little task-specific design is competent. The models achieve strong performance on the Librispeech and Gigaspeech datasets, compared to both LLM-based models and non-LLM-based models. Finally, we explore the capability emergence of LLM-based ASR in the process of modal alignment. We hope that our study can facilitate the research on extending LLM with cross-modality capacity and shed light on the LLM-based ASR community.

Speech Recognition Meets Large Language Model: Benchmarking, Models, and Exploration

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information