FUSE: Fine-Grained and Semantic-Aware Learning for Unified Image Understanding and Generation

Peng Zhang; Wanggui He; Mushui Liu; Wenyi Xiao; Siyu Zou; Yuan Li; Xingjian Wang; Guanghao Zhang; Yanpeng Liu; Weilong Dai; Jinlong Liu; Shuyi Ying; Ruikai Zhou; Yunlong Yu; Yubo Tao; Hai Lin; Hao Jiang

doi:10.1609/aaai.v40i33.40064

Authors

Peng Zhang Zhejiang University Alibaba Group
Wanggui He Alibaba Group
Mushui Liu Zhejiang University Alibaba Group
Wenyi Xiao Zhejiang University Alibaba Group
Siyu Zou Alibaba Group
Yuan Li Zhejiang University
Xingjian Wang Zhejiang University Alibaba Group
Guanghao Zhang Alibaba Group
Yanpeng Liu Alibaba Group
Weilong Dai Alibaba Group
Jinlong Liu Alibaba Group
Shuyi Ying Zhejiang University
Ruikai Zhou Alibaba Group
Yunlong Yu Zhejiang University
Yubo Tao Zhejiang University
Hai Lin Zhejiang University
Hao Jiang Alibaba Group

DOI:

https://doi.org/10.1609/aaai.v40i33.40064

Abstract

Recent unified models have demonstrated that the reasoning capacity of Multimodal Large Language Models (MLLMs) can be leveraged to facilitate diffusion-based image generation with impressive flexibility and performance. However, approaches that rely heavily on MLLMs for high-level semantic encoding often struggle with fine-grained visual tasks like image editing and virtual try-on. To address this gap, we propose FUSE, a unified framework excelling at both high-level vision–language understanding and fine-grained generation. First, we introduce a Semantic-to-Detail Connector that pre-aligns fine-grained visual features with the MLLM's semantic space. This design counteracts the low-level information loss inherent in MLLM encodings, creating a unified representation that steers the diffusion process with both global semantics and rich local details. Second, to further enhance semantic awareness and detail preservation, we introduce Adaptive-GRPO, a post-training objective that dynamically balances semantic coherence against pixel-level fidelity. The integration of these two innovations allows FUSE to generate images that are both semantically faithful and visually fine-grained. Comprehensive experiments on text-to-image and instruction-guided editing benchmarks show that FUSE significantly outperforms existing unified baselines, achieving 0.89 on Geneval, 0.65 on WISE, and 3.88 on ImageEdit.

FUSE: Fine-Grained and Semantic-Aware Learning for Unified Image Understanding and Generation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information