FreeGen: Bridging Visual-Linguistic Discrepancies Towards Diffusion-based Pixel-level Data Synthesis

Wenzhuang Wang; Mingcan Ma; Yong Chen; Changqun Xia; Zhenbao Liang; Jia Li

doi:10.1609/aaai.v39i8.32853

Authors

Wenzhuang Wang State Key Laboratory of Virtual Reality Technology and Systems, SCSE, Beihang University Geely Automobile Research Institute
Mingcan Ma Geely Automobile Research Institute
Yong Chen Geely Automobile Research Institute
Changqun Xia Pengcheng Laboratory
Zhenbao Liang Geely Automobile Research Institute
Jia Li State Key Laboratory of Virtual Reality Technology and Systems, SCSE, Beihang University

DOI:

https://doi.org/10.1609/aaai.v39i8.32853

Abstract

Text-to-image diffusion model has inspired research into text-to-data synthesis without human intervention, where spatial attentions correlated with semantic entities in text prompts are primarily interpreted as pseudo-masks. However, these vannila attentions often deliver visual-linguistic discrepancies, in which the associations between image features and entity-level tokens are unstable and divergent, yielding inferior masks for realistic applications, especially in more practical open-vocabulary settings. To tackle this issue, we propose a novel text-guided self-driven generative paradigm, termed FreeGen, which addresses the discrepancies by recalibrating intrinsic visual-linguistic correlations and serves as a real-data-free method to automatically synthesize open-vocabulary pixel-level data for arbitrary entities. Specifically, we first learn an Attention Self-Rectification mechanism to reproject the inherent attention matrices to achieve robust semantic alignment, thereby obtaining class-discriminative masks. A Temporal Fluctuation Factor is present to assess mask quality based on its variation over uniform sampling timesteps, enabling the selection of reliable masks. These masks are then employed as self-supervised signals to support the learning of an Entity-level Grounding Decoder in a self-training manner, thus producing open-vocabulary segmentation results. Extensive experiments show that the existing segmenters trained on FreeGen narrow the performance gap with real data counterparts and remarkably outperform the state-of-the-art methods.

FreeGen: Bridging Visual-Linguistic Discrepancies Towards Diffusion-based Pixel-level Data Synthesis

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information