Multi-Speaker Video Dialog with Frame-Level Temporal Localization

Qiang Wang; Pin Jiang; Zhiyi Guo; Yahong Han; Zhou Zhao

doi:10.1609/aaai.v34i07.6901

Authors

Qiang Wang Tianjin University
Pin Jiang Tianjin University
Zhiyi Guo Tianjin University
Yahong Han Tianjin University
Zhou Zhao Zhejiang University

DOI:

https://doi.org/10.1609/aaai.v34i07.6901

Abstract

To simulate human interaction in real life, dialog systems are introduced to generate a response to previous chat utterances. There have been several studies for two-speaker video dialogs in the form of question answering. However, more informative semantic cues might be exploited via a multi-rounds chatting or discussing about the video among multiple speakers. So multi-speakers video dialogs are more applicable in real life. Besides, speakers always chat about a sub-segment of the long video fragment for a period of time. Current video dialog systems require to be directly given the relevant video sub-segment which speakers are chatting about. However, it is always hard to accurately spot the corresponding video sub-segment in practical applications. In this paper, we introduce a novel task of Multi-Speaker Video Dialog with frame-level Temporal Localization (MSVD-TL) to make video dialog systems more applicable. Given a long video fragment and a set of chat history utterances, MSVD-TL targets to predict the following response and localize the relevant video sub-segment in frame level, simultaneously. We develop a new multi-task model with a response prediction module and a frame-level temporal localization module. Besides, we focus on the characteristic of the video dialog generation process and exploit the relation among the video fragment, the chat history, and the following response to refine their representations. We evaluate our approach for both the Multi-Speaker Video Dialog without frame-level temporal localization (MSVD w/o TL) task and the MSVD-TL task. The experimental results further demonstrate that MSVD-TL enhances the applicability of video dialog in real life.

Multi-Speaker Video Dialog with Frame-Level Temporal Localization

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information