DIFFA: Large Language Diffusion Models Can Listen and Understand

Authors

  • Jiaming Zhou College of Computer Science, Nankai University Institute of Artificial Intelligence (TeleAI) , China Telecom, China
  • Hongjie Chen Institute of Artificial Intelligence (TeleAI) , China Telecom, China
  • Shiwan Zhao College of Computer Science, Nankai University
  • Jian Kang Institute of Artificial Intelligence (TeleAI) , China Telecom, China
  • Jie Li Institute of Artificial Intelligence (TeleAI) , China Telecom, China
  • Enzhi Wang College of Computer Science, Nankai University
  • Yujie Guo College of Computer Science, Nankai University
  • Haoqin Sun College of Computer Science, Nankai University
  • Hui Wang College of Computer Science, Nankai University
  • Aobo Kong College of Computer Science, Nankai University
  • Yong Qin College of Computer Science, Nankai University
  • Xuelong Li Institute of Artificial Intelligence (TeleAI) , China Telecom, China

DOI:

https://doi.org/10.1609/aaai.v40i41.40817

Abstract

Recent advances in large language models (LLMs) have shown remarkable capabilities across textual and multimodal domains. In parallel, large language diffusion models have emerged as a promising alternative to the autoregressive paradigm, offering improved controllability, bidirectional context modeling, and robust generation. However, their application to the audio modality remains underexplored. In this work, we introduce DIFFA, the first diffusion-based large audio-language model designed to perform spoken language understanding. DIFFA integrates a frozen diffusion language model with a lightweight dual-adapter architecture that bridges speech understanding and natural language reasoning. We employ a two-stage training pipeline: first, aligning semantic representations via an ASR objective; then, learning instruction-following abilities through synthetic audio-caption pairs automatically generated by prompting LLMs. Despite being trained on only 960 hours of ASR and 127 hours of synthetic instruction data, DIFFA demonstrates competitive performance on major benchmarks, including MMSU, MMAU, and VoiceBench, outperforming several autoregressive open-source baselines. Our results reveal the potential of large language diffusion models for efficient and scalable audio understanding, opening a new direction for speech-driven AI.

Downloads

Published

2026-03-14

How to Cite

Zhou, J., Chen, H., Zhao, S., Kang, J., Li, J., Wang, E., Guo, Y., Sun, H., Wang, H., Kong, A., Qin, Y., & Li, X. (2026). DIFFA: Large Language Diffusion Models Can Listen and Understand. Proceedings of the AAAI Conference on Artificial Intelligence, 40(41), 35112-35120. https://doi.org/10.1609/aaai.v40i41.40817

Issue

Section

AAAI Technical Track on Natural Language Processing VI