CHEF: Cross-modal Hierarchical Embeddings for Food Domain Retrieval

Hai X. Pham; Ricardo Guerrero; Vladimir Pavlovic; Jiatong Li

doi:10.1609/aaai.v35i3.16343

Authors

Hai X. Pham Samsung AI Center Cambridge
Ricardo Guerrero Samsung AI Center Cambridge
Vladimir Pavlovic Samsung AI Center Cambridge Department of Computer Science, Rutgers University
Jiatong Li Department of Computer Science, Rutgers University

DOI:

https://doi.org/10.1609/aaai.v35i3.16343

Keywords:

Language and Vision

Abstract

Despite the abundance of multi-modal data, such as image-text pairs, there has been little effort in understanding the individual entities and their different roles in the construction of these data instances. In this work, we endeavour to discover the entities and their corresponding importance in cooking recipes automatically as a visual-linguistic association problem. More specifically, we introduce a novel cross-modal learning framework to jointly model the latent representations of images and text in the food image-recipe association and retrieval tasks. This model allows one to discover complex functional and hierarchical relationships between images and text, and among textual parts of a recipe including title, ingredients and cooking instructions. Our experiments show that by making use of efficient tree-structured Long Short-Term Memory as the text encoder in our computational cross-modal retrieval framework, we are not only able to identify the main ingredients and cooking actions in the recipe descriptions without explicit supervision, but we can also learn more meaningful feature representations of food recipes, appropriate for challenging cross-modal retrieval and recipe adaption tasks.

CHEF: Cross-modal Hierarchical Embeddings for Food Domain Retrieval

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription