Muses: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

Authors

  • Yanbo Ding Shenzhen Key Lab of Computer Vision and Pattern Recognition, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences
  • Shaobin Zhuang Shanghai Artificial Intelligence Laboratory Shanghai Jiao Tong University
  • Kunchang Li Shenzhen Key Lab of Computer Vision and Pattern Recognition, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences Shanghai Artificial Intelligence Laboratory
  • Zhengrong Yue Shanghai Artificial Intelligence Laboratory Shanghai Jiao Tong University
  • Yu Qiao Shenzhen Key Lab of Computer Vision and Pattern Recognition, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences Shanghai Artificial Intelligence Laboratory
  • Yali Wang Shenzhen Key Lab of Computer Vision and Pattern Recognition, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences Shanghai Artificial Intelligence Laboratory

DOI:

https://doi.org/10.1609/aaai.v39i3.32280

Abstract

Despite recent advancements in text-to-image generation, most existing methods struggle to create images with multiple objects and complex spatial relationships in the 3D world. To tackle this limitation, we introduce a generic AI system, namely MUSES, for 3D-controllable image generation from user queries. Specifically, our MUSES develops a progressive workflow with three key components, including (1) Layout Manager for 2D-to-3D layout lifting, (2) Model Engineer for 3D object acquisition and calibration, (3) Image Artist for 3D-to-2D image rendering. By mimicking the collaboration of human professionals, this multi-modal agent pipeline facilitates the effective and automatic creation of images with 3D-controllable objects, through an explainable integration of top-down planning and bottom-up generation. Additionally, existing benchmarks lack detailed descriptions of complex 3D spatial relationships of multiple objects. To fill this gap, we further construct a new benchmark of T2I-3DisBench (3D image scene), which describes diverse 3D image scenes with 50 detailed prompts. Extensive experiments show the state-of-the-art performance of MUSES on both T2I-CompBench and T2I-3DisBench, outperforming recent strong competitors such as DALL-E 3 and Stable Diffusion 3. These results demonstrate a significant step forward for MUSES in bridging natural language, 2D image generation, and 3D world.

Downloads

Published

2025-04-11

How to Cite

Ding, Y., Zhuang, S., Li, K., Yue, Z., Qiao, Y., & Wang, Y. (2025). Muses: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration. Proceedings of the AAAI Conference on Artificial Intelligence, 39(3), 2753-2761. https://doi.org/10.1609/aaai.v39i3.32280

Issue

Section

AAAI Technical Track on Computer Vision II