AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

Rongjie Huang; Mingze Li; Dongchao Yang; Jiatong Shi; Xuankai Chang; Zhenhui Ye; Yuning Wu; Zhiqing Hong; Jiawei Huang; Jinglin Liu; Yi Ren; Yuexian Zou; Zhou Zhao; Shinji Watanabe

doi:10.1609/aaai.v38i21.30570

Authors

Rongjie Huang Zhejiang University
Mingze Li Zhejiang University
Dongchao Yang Peking University
Jiatong Shi Carnegie Mellon University
Xuankai Chang Carnegie Mellon University
Zhenhui Ye Zhejiang University
Yuning Wu Remin University of China
Zhiqing Hong Zhejiang University
Jiawei Huang Zhejiang University
Jinglin Liu Zhejiang University
Yi Ren Zhejiang University
Yuexian Zou Peking University
Zhou Zhao Zhejiang University
Shinji Watanabe Carnegie Mellon University

DOI:

https://doi.org/10.1609/aaai.v38i21.30570

Keywords:

Artificial Intelligence, Natural language processing and speech recognition, Human-AI interaction (including Human-robot interaction)

Abstract

Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. Despite the recent success, current LLMs are not capable of processing complex audio information or conducting spoken conversations (like Siri or Alexa). In this work, we propose a multi-modal AI system named AudioGPT, which complements LLMs (i.e., ChatGPT) with 1) foundation models to process complex audio information and solve numerous understanding and generation tasks; and 2) the input/output interface (ASR, TTS) to support spoken dialogue. With an increasing demand to evaluate multi-modal LLMs of human intention understanding and cooperation with foundation models, we outline the principles and processes and test AudioGPT in terms of consistency, capability, and robustness. Experimental results demonstrate the capabilities of AudioGPT in solving 16 AI tasks with speech, music, sound, and talking head understanding and generation in multi-round dialogues, which empower humans to create rich and diverse audio content with unprecedented ease. Code can be found in https://github.com/AIGC-Audio/AudioGPT

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription