Reject Decoding via Language-Vision Models for Text-to-Image Synthesis

Authors

  • Fuxiang Wu Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
  • Liu Liu The University of Sydney
  • Fusheng Hao Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
  • Fengxiang He JD.com
  • Lei Wang Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
  • Jun Cheng Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences

DOI:

https://doi.org/10.1609/aaai.v37i3.25379

Keywords:

CV: Language and Vision, CV: Computational Photography, Image & Video Synthesis

Abstract

Transformer-based text-to-image synthesis generates images from abstractive textual conditions and achieves prompt results. Since transformer-based models predict visual tokens step by step in testing, where the early error is hard to be corrected and would be propagated. To alleviate this issue, the common practice is drawing multi-paths from the transformer-based models and re-ranking the multi-images decoded from multi-paths to find the best one and filter out others. Therefore, the computing procedure of excluding images may be inefficient. To improve the effectiveness and efficiency of decoding, we exploit a reject decoding algorithm with tiny multi-modal models to enlarge the searching space and exclude the useless paths as early as possible. Specifically, we build tiny multi-modal models to evaluate the similarities between the partial paths and the caption at multi scales. Then, we propose a reject decoding algorithm to exclude some lowest quality partial paths at the inner steps. Thus, under the same computing load as the original decoding, we could search across more multi-paths to improve the decoding efficiency and synthesizing quality. The experiments conducted on the MS-COCO dataset and large-scale datasets show that the proposed reject decoding algorithm can exclude the useless paths and enlarge the searching paths to improve the synthesizing quality by consuming less time.

Downloads

Published

2023-06-26

How to Cite

Wu, F., Liu, L., Hao, F., He, F., Wang, L., & Cheng, J. (2023). Reject Decoding via Language-Vision Models for Text-to-Image Synthesis. Proceedings of the AAAI Conference on Artificial Intelligence, 37(3), 2785-2794. https://doi.org/10.1609/aaai.v37i3.25379

Issue

Section

AAAI Technical Track on Computer Vision III