Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework

Ran Xu; Caiming Xiong; Wei Chen; Jason Corso

doi:10.1609/aaai.v29i1.9512

Authors

Ran Xu State University of New York at Buffalo
Caiming Xiong University of California, Los Angeles
Wei Chen State University of New York at Buffalo
Jason Corso University of Michagan

DOI:

https://doi.org/10.1609/aaai.v29i1.9512

Keywords:

Natural Language Generation, Deep Learning, Video Content Analysis, Video to Text, Multi-Modality

Abstract

Recently, joint video-language modeling has been attracting more and more attention. However, most existing approaches focus on exploring the language model upon on a fixed visual model. In this paper, we propose a unified framework that jointly models video and the corresponding text sentences. The framework consists of three parts: a compositional semantics language model, a deep video model and a joint embedding model. In our language model, we propose a dependency-tree structure model that embeds sentence into a continuous vector space, which preserves visually grounded meanings and word order. In the visual model, we leverage deep neural networks to capture essential semantic information from videos. In the joint embedding model, we minimize the distance of the outputs of the deep video model and compositional language model in the joint space, and update these two models jointly. Based on these three parts, our system is able to accomplish three tasks: 1) natural language generation, and 2) video retrieval and 3) language retrieval. In the experiments, the results show our approach outperforms SVM, CRF and CCA baselines in predicting Subject-Verb-Object triplet and natural sentence generation, and is better than CCA in video retrieval and language retrieval tasks.

Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information