Sequence-to-Sequence Learning via Shared Latent Representation

Xu Shen; Xinmei Tian; Jun Xing; Yong Rui; Dacheng Tao

doi:10.1609/aaai.v32i1.11837

Authors

Xu Shen University of Science and Technology of China
Xinmei Tian University of Science and Technology of China
Jun Xing University of Southern California
Yong Rui Lenovo Research
Dacheng Tao University of Sydney

DOI:

https://doi.org/10.1609/aaai.v32i1.11837

Abstract

Sequence-to-sequence learning is a popular research area in deep learning, such as video captioning and speech recognition. Existing methods model this learning as a mapping process by first encoding the input sequence to a fixed-sized vector, followed by decoding the target sequence from the vector. Although simple and intuitive, such mapping model is task-specific, unable to be directly used for different tasks. In this paper, we propose a star-like framework for general and flexible sequence-to-sequence learning, where different types of media contents (the peripheral nodes) could be encoded to and decoded from a shared latent representation (SLR) (the central node). This is inspired by the fact that human brain could learn and express an abstract concept in different ways. The media-invariant property of SLR could be seen as a high-level regularization on the intermediate vector, enforcing it to not only capture the latent representation intra each individual media like the auto-encoders, but also their transitions like the mapping models. Moreover, the SLR model is content-specific, which means it only needs to be trained once for a dataset, while used for different tasks. We show how to train a SLR model via dropout and use it for different sequence-to-sequence tasks. Our SLR model is validated on the Youtube2Text and MSR-VTT datasets, achieving superior performance on video-to-sentence task, and the first sentence-to-video results.

Sequence-to-Sequence Learning via Shared Latent Representation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information