Audio-Thinker: Guiding Large Audio Language Model When and How to Think via Reinforcement Learning

Authors

  • Shu Wu Tencent AI Lab, Beijing,China
  • Chenxing Li Tencent AI Lab, Beijing,China
  • Wenfu Wang Tencent AI Lab, Beijing,China
  • Hao Zhang Tencent AI Lab, Seattle,USA
  • Hualei Wang Tencent AI Lab, Beijing,China
  • Meng Yu Tencent AI Lab, Seattle,USA
  • Dong Yu Tencent AI Lab, Seattle,USA

DOI:

https://doi.org/10.1609/aaai.v40i40.40689

Abstract

Recent advancements in large language models, multimodal large language models, and large audio language models (LALMs) have significantly improved their reasoning capabilities through reinforcement learning utilizing rule-based rewards. However, the explicit reasoning process has not yet yielded substantial benefits for audio question answering, and effectively leveraging deep reasoning remains an open challenge, with LALMs still falling short of achieving human-level auditory-language reasoning. To address these limitations, we propose Audio-Thinker, a reinforcement learning framework designed to enhance the reasoning capabilities of LALMs through improved adaptability, consistency, and effectiveness. Our approach introduces an adaptive think accuracy reward, enabling the model to adjust its reasoning strategies based on task complexity. Furthermore, we incorporate an external reward model to evaluate the overall consistency and quality of the reasoning process, complemented by think-based rewards that assist the model in distinguishing between valid and flawed reasoning paths during training. Experimental results demonstrate that Audio-Thinker models outperform existing reasoning-oriented LALMs across various benchmark tasks, exhibiting superior reasoning and generalization capabilities.

Published

2026-03-14

How to Cite

Wu, S., Li, C., Wang, W., Zhang, H., Wang, H., Yu, M., & Yu, D. (2026). Audio-Thinker: Guiding Large Audio Language Model When and How to Think via Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 40(40), 33962-33970. https://doi.org/10.1609/aaai.v40i40.40689

Issue

Section

AAAI Technical Track on Natural Language Processing V