SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement
DOI:
https://doi.org/10.1609/aaai.v40i36.40338Abstract
Speech-to-Speech (S2S) Large Language Models (LLMs) are foundational to natural human-computer interaction, enabling end-to-end spoken dialogue systems. However, evaluating these models remains a fundamental challenge. We propose SageLM, an end-to-end, multi-aspect, and explainable speech LLM for comprehensive S2S LLMs evaluation. First, unlike cascaded approaches that disregard acoustic features, SageLM jointly assesses both semantic and acoustic dimensions. Second, it leverages rationale-based supervision to enhance explainability and guide model learning, achieving superior alignment with evaluation outcomes compared to rule-based reinforcement learning methods. Third, we introduce SpeechFeedback, a synthetic preference dataset, and employ a two-stage training paradigm to mitigate the scarcity of speech preference data. Trained on both semantic and acoustic dimensions, SageLM achieves an 82.79% agreement rate with human evaluators, outperforming cascaded and SLM-based baselines by at least 7.42% and 26.20%, respectively.Published
2026-03-14
How to Cite
Ge, Y., Zhang, J., Liu, X., Li, B., Ma, X., Wang, C., Ye, K., Du, Y., Zhang, L., Huang, Y., Xiao, T., Yu, Z., & Zhu, J. (2026). SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement. Proceedings of the AAAI Conference on Artificial Intelligence, 40(36), 30807-30815. https://doi.org/10.1609/aaai.v40i36.40338
Issue
Section
AAAI Technical Track on Natural Language Processing I