Multi-Task Deep Learning for User Intention Understanding in Speech Interaction Systems

Yishuang Ning; Jia Jia; Zhiyong Wu; Runnan Li; Yongsheng An; Yanfeng Wang; Helen Meng

doi:10.1609/aaai.v31i1.10493

Authors

Yishuang Ning Tsinghua University
Jia Jia Tsinghua University
Zhiyong Wu Tsinghua University
Runnan Li Tsinghua University
Yongsheng An Tsinghua University
Yanfeng Wang Beijing Sougou Science and Technology Development Co., Ltd.
Helen Meng The Chinese University of Hong Kong

DOI:

https://doi.org/10.1609/aaai.v31i1.10493

Keywords:

Intention prominence, User intention understanding, Long Short-Term Memory (LSTM), Multi-task

Abstract

Speech interaction systems have been gaining popularity in recent years. The main purpose of these systems is to generate more satisfactory responses according to users' speech utterances, in which the most critical problem is to analyze user intention. Researches show that user intention conveyed through speech is not only expressed by content, but also closely related with users' speaking manners (e.g. with or without acoustic emphasis). How to incorporate these heterogeneous attributes to infer user intention remains an open problem. In this paper, we define Intention Prominence (IP) as the semantic combination of focus by text and emphasis by speech, and propose a multi-task deep learning framework to predict IP. Specifically, we first use long short-term memory (LSTM) which is capable of modeling long short-term contextual dependencies to detect focus and emphasis, and incorporate the tasks for focus and emphasis detection with multi-task learning (MTL) to reinforce the performance of each other. We then employ Bayesian network (BN) to incorporate multimodal features (focus, emphasis, and location reflecting users' dialect conventions) to predict IP based on feature correlations. Experiments on a data set of 135,566 utterances collected from real-world Sogou Voice Assistant illustrate that our method can outperform the comparison methods over 6.9-24.5% in terms of F1-measure. Moreover, a real practice in the Sogou Voice Assistant indicates that our method can improve the performance on user intention understanding by 7%.

Multi-Task Deep Learning for User Intention Understanding in Speech Interaction Systems

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription