Audio-Thinker: Guiding Large Audio Language Model When and How to Think via Reinforcement Learning

Shu Wu; Chenxing Li; Wenfu Wang; Hao Zhang; Hualei Wang; Meng Yu; Dong Yu

doi:10.1609/aaai.v40i40.40689

Authors

Shu Wu Tencent AI Lab, Beijing,China
Chenxing Li Tencent AI Lab, Beijing,China
Wenfu Wang Tencent AI Lab, Beijing,China
Hao Zhang Tencent AI Lab, Seattle,USA
Hualei Wang Tencent AI Lab, Beijing,China
Meng Yu Tencent AI Lab, Seattle,USA
Dong Yu Tencent AI Lab, Seattle,USA

DOI:

https://doi.org/10.1609/aaai.v40i40.40689

Abstract

Recent advancements in large language models, multimodal large language models, and large audio language models (LALMs) have significantly improved their reasoning capabilities through reinforcement learning utilizing rule-based rewards. However, the explicit reasoning process has not yet yielded substantial benefits for audio question answering, and effectively leveraging deep reasoning remains an open challenge, with LALMs still falling short of achieving human-level auditory-language reasoning. To address these limitations, we propose Audio-Thinker, a reinforcement learning framework designed to enhance the reasoning capabilities of LALMs through improved adaptability, consistency, and effectiveness. Our approach introduces an adaptive think accuracy reward, enabling the model to adjust its reasoning strategies based on task complexity. Furthermore, we incorporate an external reward model to evaluate the overall consistency and quality of the reasoning process, complemented by think-based rewards that assist the model in distinguishing between valid and flawed reasoning paths during training. Experimental results demonstrate that Audio-Thinker models outperform existing reasoning-oriented LALMs across various benchmark tasks, exhibiting superior reasoning and generalization capabilities.

Audio-Thinker: Guiding Large Audio Language Model When and How to Think via Reinforcement Learning

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information