Adaptive Cross-Modal Embeddings for Image-Text Alignment

Authors

  • Jonatas Wehrmann Pontifícia Universidade Católica do Rio Grande do Sul
  • Camila Kolling Pontifícia Universidade Católica do Rio Grande do Sul
  • Rodrigo C Barros Pontifícia Universidade Católica do Rio Grande do Sul

DOI:

https://doi.org/10.1609/aaai.v34i07.6915

Abstract

a using an embedding vector of an instance from modality b. Such an adaptation is designed to filter and enhance important information across internal features, allowing for guided vector representations – which resembles the working of attention modules, though far more computationally efficient. Experimental results on two large-scale Image-Text alignment datasets show that ADAPT models outperform all the baseline approaches by large margins. Particularly, for Image Retrieval, ADAPT, with a single model, outperforms the state-of-the-art approach by a relative improvement of R@1 ≈ 24% and for Image Annotation, R@1 ≈ 8% on Flickr30k dataset. On MS COCO it provides an improvement of R@1 ≈ 12% for Image Retrieval, and ≈ 7% R@1 for Image Annotation. Code is available at https://github.com/jwehrmann/retrieval.pytorch.

Downloads

Published

2020-04-03

How to Cite

Wehrmann, J., Kolling, C., & C Barros, R. (2020). Adaptive Cross-Modal Embeddings for Image-Text Alignment. Proceedings of the AAAI Conference on Artificial Intelligence, 34(07), 12313-12320. https://doi.org/10.1609/aaai.v34i07.6915

Issue

Section

AAAI Technical Track: Vision