MM-R1: Unleashing the Power of Unified Multimodal Large Language Models for Personalized Image Generation

Authors

  • Qian Liang University of Electronic Science and Technology of China
  • Yujia Wu University of Electronic Science and Technology of China
  • Kuncheng Li University of Electronic Science and Technology of China
  • Jiwei Wei University of Electronic Science and Technology of China
  • Shiyuan He University of Electronic Science and Technology of China
  • Jinyu Guo University of Electronic Science and Technology of China
  • Ning Xie University of Electronic Science and Technology of China

DOI:

https://doi.org/10.1609/aaai.v40i9.37616

Abstract

Multimodal Large Language Models (MLLMs) with unified architectures excel across a wide range of vision-language tasks, yet aligning them with personalized image generation remains a significant challenge. Existing methods for MLLMs are frequently subject-specific, demanding a data-intensive fine-tuning process for every new subject, which limits their scalability. In this paper, we introduce MM-R1, a framework that integrates a cross-modal Chain-of-Thought (X-CoT) reasoning strategy to unlock the inherent potential of unified MLLMs for personalized image generation. Specifically, we structure personalization as an integrated visual reasoning and generation process: (1) grounding subject concepts by interpreting and understanding user-provided images and contextual cues, and (2) generating personalized images conditioned on both the extracted subject representations and user prompts. To further enhance the reasoning capability, we adopt Grouped Reward Proximal Policy Optimization(GRPO) to explicitly align the generation. Experiments demonstrate that MM-R1 unleashes the personalization capability of unified MLLMs to generate images with high subject fidelity and strong text alignment in a zero-shot manner.

Downloads

Published

2026-03-14

How to Cite

Liang, Q., Wu, Y., Li, K., Wei, J., He, S., Guo, J., & Xie, N. (2026). MM-R1: Unleashing the Power of Unified Multimodal Large Language Models for Personalized Image Generation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(9), 6835–6843. https://doi.org/10.1609/aaai.v40i9.37616

Issue

Section

AAAI Technical Track on Computer Vision VI