Imagined Visual Representations as Multimodal Embeddings

Guillem Collell; Ted Zhang; Marie-Francine Moens

doi:10.1609/aaai.v31i1.11155

Authors

Guillem Collell Katholieke Universiteit Leuven
Ted Zhang Katholieke Universiteit Leuven
Marie-Francine Moens Katholieke Universiteit Leuven

DOI:

https://doi.org/10.1609/aaai.v31i1.11155

Keywords:

multimodal representations, representation learning, semantic similarity, semantic relatedness, visual similarity

Abstract

Language and vision provide complementary information. Integrating both modalities in a single multimodal representation is an unsolved problem with wide-reaching applications to both natural language processing and computer vision. In this paper, we present a simple and effective method that learns a language-to-vision mapping and uses its output visual predictions to build multimodal representations. In this sense, our method provides a cognitively plausible way of building representations, consistent with the inherently re-constructive and associative nature of human memory. Using seven benchmark concept similarity tests we show that the mapped (or imagined) vectors not only help to fuse multimodal information, but also outperform strong unimodal baselines and state-of-the-art multimodal methods, thus exhibiting more human-like judgments. Ultimately, the present work sheds light on fundamental questions of natural language understanding concerning the fusion of vision and language such as the plausibility of more associative and re-constructive approaches.

Imagined Visual Representations as Multimodal Embeddings

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription