FUSE: Fine-Grained and Semantic-Aware Learning for Unified Image Understanding and Generation

Authors

  • Peng Zhang Zhejiang University Alibaba Group
  • Wanggui He Alibaba Group
  • Mushui Liu Zhejiang University Alibaba Group
  • Wenyi Xiao Zhejiang University Alibaba Group
  • Siyu Zou Alibaba Group
  • Yuan Li Zhejiang University
  • Xingjian Wang Zhejiang University Alibaba Group
  • Guanghao Zhang Alibaba Group
  • Yanpeng Liu Alibaba Group
  • Weilong Dai Alibaba Group
  • Jinlong Liu Alibaba Group
  • Shuyi Ying Zhejiang University
  • Ruikai Zhou Alibaba Group
  • Yunlong Yu Zhejiang University
  • Yubo Tao Zhejiang University
  • Hai Lin Zhejiang University
  • Hao Jiang Alibaba Group

DOI:

https://doi.org/10.1609/aaai.v40i33.40064

Abstract

Recent unified models have demonstrated that the reasoning capacity of Multimodal Large Language Models (MLLMs) can be leveraged to facilitate diffusion-based image generation with impressive flexibility and performance. However, approaches that rely heavily on MLLMs for high-level semantic encoding often struggle with fine-grained visual tasks like image editing and virtual try-on. To address this gap, we propose FUSE, a unified framework excelling at both high-level vision–language understanding and fine-grained generation. First, we introduce a Semantic-to-Detail Connector that pre-aligns fine-grained visual features with the MLLM's semantic space. This design counteracts the low-level information loss inherent in MLLM encodings, creating a unified representation that steers the diffusion process with both global semantics and rich local details. Second, to further enhance semantic awareness and detail preservation, we introduce Adaptive-GRPO, a post-training objective that dynamically balances semantic coherence against pixel-level fidelity. The integration of these two innovations allows FUSE to generate images that are both semantically faithful and visually fine-grained. Comprehensive experiments on text-to-image and instruction-guided editing benchmarks show that FUSE significantly outperforms existing unified baselines, achieving 0.89 on Geneval, 0.65 on WISE, and 3.88 on ImageEdit.

Downloads

Published

2026-03-14

How to Cite

Zhang, P., He, W., Liu, M., Xiao, W., Zou, S., Li, Y., … Jiang, H. (2026). FUSE: Fine-Grained and Semantic-Aware Learning for Unified Image Understanding and Generation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(33), 28355–28363. https://doi.org/10.1609/aaai.v40i33.40064

Issue

Section

AAAI Technical Track on Machine Learning X