TIME: Text and Image Mutual-Translation Adversarial Networks

Authors

  • Bingchen Liu Department of Computer Science, Rutgers University
  • Kunpeng Song Department of Computer Science, Rutgers University
  • Yizhe Zhu Department of Computer Science, Rutgers University
  • Gerard de Melo Department of Computer Science, Rutgers University
  • Ahmed Elgammal Department of Computer Science, Rutgers University

DOI:

https://doi.org/10.1609/aaai.v35i3.16305

Keywords:

Language and Vision, Adversarial Learning & Robustness, Language Models, (Deep) Neural Network Algorithms

Abstract

Focusing on text-to-image (T2I) generation, we propose Text and Image Mutual-Translation Adversarial Networks (TIME), a lightweight but effective model that jointly learns a T2I generator G and an image captioning discriminator D under the Generative Adversarial Network framework. While previous methods tackle the T2I problem as a uni-directional task and use pre-trained language models to enforce the image--text consistency, TIME requires neither extra modules nor pre-training. We show that the performance of G can be boosted substantially by training it jointly with D as a language model. Specifically, we adopt Transformers to model the cross-modal connections between the image features and word embeddings, and design an annealing conditional hinge loss that dynamically balances the adversarial learning. In our experiments, TIME achieves state-of-the-art (SOTA) performance on the CUB dataset (Inception Score of 4.91 and Fréchet Inception Distance of 14.3 on CUB), and shows promising performance on MS-COCO dataset on image captioning and downstream vision-language tasks.

Downloads

Published

2021-05-18

How to Cite

Liu, B., Song, K., Zhu, Y., de Melo, G., & Elgammal, A. (2021). TIME: Text and Image Mutual-Translation Adversarial Networks. Proceedings of the AAAI Conference on Artificial Intelligence, 35(3), 2082-2090. https://doi.org/10.1609/aaai.v35i3.16305

Issue

Section

AAAI Technical Track on Computer Vision II