TIME: Text and Image Mutual-Translation Adversarial Networks

Bingchen Liu; Kunpeng Song; Yizhe Zhu; Gerard de Melo; Ahmed Elgammal

doi:10.1609/aaai.v35i3.16305

Authors

Bingchen Liu Department of Computer Science, Rutgers University
Kunpeng Song Department of Computer Science, Rutgers University
Yizhe Zhu Department of Computer Science, Rutgers University
Gerard de Melo Department of Computer Science, Rutgers University
Ahmed Elgammal Department of Computer Science, Rutgers University

DOI:

https://doi.org/10.1609/aaai.v35i3.16305

Keywords:

Language and Vision, Adversarial Learning & Robustness, Language Models, (Deep) Neural Network Algorithms

Abstract

Focusing on text-to-image (T2I) generation, we propose Text and Image Mutual-Translation Adversarial Networks (TIME), a lightweight but effective model that jointly learns a T2I generator G and an image captioning discriminator D under the Generative Adversarial Network framework. While previous methods tackle the T2I problem as a uni-directional task and use pre-trained language models to enforce the image--text consistency, TIME requires neither extra modules nor pre-training. We show that the performance of G can be boosted substantially by training it jointly with D as a language model. Specifically, we adopt Transformers to model the cross-modal connections between the image features and word embeddings, and design an annealing conditional hinge loss that dynamically balances the adversarial learning. In our experiments, TIME achieves state-of-the-art (SOTA) performance on the CUB dataset (Inception Score of 4.91 and Fréchet Inception Distance of 14.3 on CUB), and shows promising performance on MS-COCO dataset on image captioning and downstream vision-language tasks.

TIME: Text and Image Mutual-Translation Adversarial Networks

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription