Multi-Task Deep Learning for User Intention Understanding in Speech Interaction Systems


  • Yishuang Ning Tsinghua University
  • Jia Jia Tsinghua University
  • Zhiyong Wu Tsinghua University
  • Runnan Li Tsinghua University
  • Yongsheng An Tsinghua University
  • Yanfeng Wang Beijing Sougou Science and Technology Development Co., Ltd.
  • Helen Meng The Chinese University of Hong Kong


Intention prominence, User intention understanding, Long Short-Term Memory (LSTM), Multi-task


Speech interaction systems have been gaining popularity in recent years. The main purpose of these systems is to generate more satisfactory responses according to users' speech utterances, in which the most critical problem is to analyze user intention. Researches show that user intention conveyed through speech is not only expressed by content, but also closely related with users' speaking manners (e.g. with or without acoustic emphasis). How to incorporate these heterogeneous attributes to infer user intention remains an open problem. In this paper, we define Intention Prominence (IP) as the semantic combination of focus by text and emphasis by speech, and propose a multi-task deep learning framework to predict IP. Specifically, we first use long short-term memory (LSTM) which is capable of modeling long short-term contextual dependencies to detect focus and emphasis, and incorporate the tasks for focus and emphasis detection with multi-task learning (MTL) to reinforce the performance of each other. We then employ Bayesian network (BN) to incorporate multimodal features (focus, emphasis, and location reflecting users' dialect conventions) to predict IP based on feature correlations. Experiments on a data set of 135,566 utterances collected from real-world Sogou Voice Assistant illustrate that our method can outperform the comparison methods over 6.9-24.5% in terms of F1-measure. Moreover, a real practice in the Sogou Voice Assistant indicates that our method can improve the performance on user intention understanding by 7%.




How to Cite

Ning, Y., Jia, J., Wu, Z., Li, R., An, Y., Wang, Y., & Meng, H. (2017). Multi-Task Deep Learning for User Intention Understanding in Speech Interaction Systems. Proceedings of the AAAI Conference on Artificial Intelligence, 31(1). Retrieved from