Muses: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

Yanbo Ding; Shaobin Zhuang; Kunchang Li; Zhengrong Yue; Yu Qiao; Yali Wang

doi:10.1609/aaai.v39i3.32280

Authors

Yanbo Ding Shenzhen Key Lab of Computer Vision and Pattern Recognition, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences
Shaobin Zhuang Shanghai Artificial Intelligence Laboratory Shanghai Jiao Tong University
Kunchang Li Shenzhen Key Lab of Computer Vision and Pattern Recognition, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences Shanghai Artificial Intelligence Laboratory
Zhengrong Yue Shanghai Artificial Intelligence Laboratory Shanghai Jiao Tong University
Yu Qiao Shenzhen Key Lab of Computer Vision and Pattern Recognition, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences Shanghai Artificial Intelligence Laboratory
Yali Wang Shenzhen Key Lab of Computer Vision and Pattern Recognition, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences Shanghai Artificial Intelligence Laboratory

DOI:

https://doi.org/10.1609/aaai.v39i3.32280

Abstract

Despite recent advancements in text-to-image generation, most existing methods struggle to create images with multiple objects and complex spatial relationships in the 3D world. To tackle this limitation, we introduce a generic AI system, namely MUSES, for 3D-controllable image generation from user queries. Specifically, our MUSES develops a progressive workflow with three key components, including (1) Layout Manager for 2D-to-3D layout lifting, (2) Model Engineer for 3D object acquisition and calibration, (3) Image Artist for 3D-to-2D image rendering. By mimicking the collaboration of human professionals, this multi-modal agent pipeline facilitates the effective and automatic creation of images with 3D-controllable objects, through an explainable integration of top-down planning and bottom-up generation. Additionally, existing benchmarks lack detailed descriptions of complex 3D spatial relationships of multiple objects. To fill this gap, we further construct a new benchmark of T2I-3DisBench (3D image scene), which describes diverse 3D image scenes with 50 detailed prompts. Extensive experiments show the state-of-the-art performance of MUSES on both T2I-CompBench and T2I-3DisBench, outperforming recent strong competitors such as DALL-E 3 and Stable Diffusion 3. These results demonstrate a significant step forward for MUSES in bridging natural language, 2D image generation, and 3D world.

Muses: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information