Reject Decoding via Language-Vision Models for Text-to-Image Synthesis

Fuxiang Wu; Liu Liu; Fusheng Hao; Fengxiang He; Lei Wang; Jun Cheng

doi:10.1609/aaai.v37i3.25379

Authors

Fuxiang Wu Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
Liu Liu The University of Sydney
Fusheng Hao Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
Fengxiang He JD.com
Lei Wang Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
Jun Cheng Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences

DOI:

https://doi.org/10.1609/aaai.v37i3.25379

Keywords:

CV: Language and Vision, CV: Computational Photography, Image & Video Synthesis

Abstract

Transformer-based text-to-image synthesis generates images from abstractive textual conditions and achieves prompt results. Since transformer-based models predict visual tokens step by step in testing, where the early error is hard to be corrected and would be propagated. To alleviate this issue, the common practice is drawing multi-paths from the transformer-based models and re-ranking the multi-images decoded from multi-paths to find the best one and filter out others. Therefore, the computing procedure of excluding images may be inefficient. To improve the effectiveness and efficiency of decoding, we exploit a reject decoding algorithm with tiny multi-modal models to enlarge the searching space and exclude the useless paths as early as possible. Specifically, we build tiny multi-modal models to evaluate the similarities between the partial paths and the caption at multi scales. Then, we propose a reject decoding algorithm to exclude some lowest quality partial paths at the inner steps. Thus, under the same computing load as the original decoding, we could search across more multi-paths to improve the decoding efficiency and synthesizing quality. The experiments conducted on the MS-COCO dataset and large-scale datasets show that the proposed reject decoding algorithm can exclude the useless paths and enlarge the searching paths to improve the synthesizing quality by consuming less time.

Reject Decoding via Language-Vision Models for Text-to-Image Synthesis

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription