Reject Decoding via Language-Vision Models for Text-to-Image Synthesis
DOI:
https://doi.org/10.1609/aaai.v37i3.25379Keywords:
CV: Language and Vision, CV: Computational Photography, Image & Video SynthesisAbstract
Transformer-based text-to-image synthesis generates images from abstractive textual conditions and achieves prompt results. Since transformer-based models predict visual tokens step by step in testing, where the early error is hard to be corrected and would be propagated. To alleviate this issue, the common practice is drawing multi-paths from the transformer-based models and re-ranking the multi-images decoded from multi-paths to find the best one and filter out others. Therefore, the computing procedure of excluding images may be inefficient. To improve the effectiveness and efficiency of decoding, we exploit a reject decoding algorithm with tiny multi-modal models to enlarge the searching space and exclude the useless paths as early as possible. Specifically, we build tiny multi-modal models to evaluate the similarities between the partial paths and the caption at multi scales. Then, we propose a reject decoding algorithm to exclude some lowest quality partial paths at the inner steps. Thus, under the same computing load as the original decoding, we could search across more multi-paths to improve the decoding efficiency and synthesizing quality. The experiments conducted on the MS-COCO dataset and large-scale datasets show that the proposed reject decoding algorithm can exclude the useless paths and enlarge the searching paths to improve the synthesizing quality by consuming less time.Downloads
Published
2023-06-26
How to Cite
Wu, F., Liu, L., Hao, F., He, F., Wang, L., & Cheng, J. (2023). Reject Decoding via Language-Vision Models for Text-to-Image Synthesis. Proceedings of the AAAI Conference on Artificial Intelligence, 37(3), 2785-2794. https://doi.org/10.1609/aaai.v37i3.25379
Issue
Section
AAAI Technical Track on Computer Vision III