SAGA: Learning Signal-Aligned Distributions for Improved Text-to-Image Generation

Authors

  • Paul Grimal CEA
  • Michael Soumm Télécom Paris
  • Hervé Le Borgne CEA
  • Olivier Ferret CEA
  • Akihiro Sugimoto NII, National Institute of Informatics

DOI:

https://doi.org/10.1609/aaai.v40i6.42426

Abstract

State-of-the-art text-to-image models produce visually impressive results but often struggle with precise alignment to text prompts, leading to missing critical elements or unintended blending of distinct concepts. We propose a novel approach that learns a high-success-rate distribution conditioned on a target prompt, ensuring that generated images faithfully reflect the corresponding prompts. Our method explicitly models the signal component during the denoising process, offering fine-grained control that mitigates over-optimization and out-of-distribution artifacts. Moreover, our framework is training-free and seamlessly integrates with both existing diffusion and flow matching architectures. It also supports additional conditioning modalities -- such as bounding boxes -- for enhanced spatial alignment. Extensive experiments demonstrate that our approach outperforms current state-of-the-art methods.

Published

2026-03-14

How to Cite

Grimal, P., Soumm, M., Le Borgne, H., Ferret, O., & Sugimoto, A. (2026). SAGA: Learning Signal-Aligned Distributions for Improved Text-to-Image Generation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(6), 4293–4301. https://doi.org/10.1609/aaai.v40i6.42426

Issue

Section

AAAI Technical Track on Computer Vision III