FUSE: Fine-Grained and Semantic-Aware Learning for Unified Image Understanding and Generation
DOI:
https://doi.org/10.1609/aaai.v40i33.40064Abstract
Recent unified models have demonstrated that the reasoning capacity of Multimodal Large Language Models (MLLMs) can be leveraged to facilitate diffusion-based image generation with impressive flexibility and performance. However, approaches that rely heavily on MLLMs for high-level semantic encoding often struggle with fine-grained visual tasks like image editing and virtual try-on. To address this gap, we propose FUSE, a unified framework excelling at both high-level vision–language understanding and fine-grained generation. First, we introduce a Semantic-to-Detail Connector that pre-aligns fine-grained visual features with the MLLM's semantic space. This design counteracts the low-level information loss inherent in MLLM encodings, creating a unified representation that steers the diffusion process with both global semantics and rich local details. Second, to further enhance semantic awareness and detail preservation, we introduce Adaptive-GRPO, a post-training objective that dynamically balances semantic coherence against pixel-level fidelity. The integration of these two innovations allows FUSE to generate images that are both semantically faithful and visually fine-grained. Comprehensive experiments on text-to-image and instruction-guided editing benchmarks show that FUSE significantly outperforms existing unified baselines, achieving 0.89 on Geneval, 0.65 on WISE, and 3.88 on ImageEdit.Downloads
Published
2026-03-14
How to Cite
Zhang, P., He, W., Liu, M., Xiao, W., Zou, S., Li, Y., … Jiang, H. (2026). FUSE: Fine-Grained and Semantic-Aware Learning for Unified Image Understanding and Generation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(33), 28355–28363. https://doi.org/10.1609/aaai.v40i33.40064
Issue
Section
AAAI Technical Track on Machine Learning X