GENMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

Authors

  • Kaiyi Huang University of Hong Kong
  • Yukun Huang University of Hong Kong
  • Xuefei Ning Tsinghua University
  • Zinan Lin Microsoft
  • Yu Wang Tsinghua University, Tsinghua University
  • Xihui Liu University of Hong Kong

DOI:

https://doi.org/10.1609/aaai.v40i7.37418

Abstract

Text-to-video generation models have shown significant progress in recent years. However, they still struggle with compositional text prompts, such as attribute binding for multiple objects, temporal dynamics associated with differ- ent objects, and interactions between objects. Inspired by ef- fective human creative workflow, we propose GENMAC, a multi-agent collaboration framework that enables composi- tional text-to-video generation. The framework incorporates a three-stage collaborative workflow: DESIGN, GENERATION, and REDESIGN, with an iterative loop between the latter two stages to progressively verify and refine the generated videos. In the DESIGN stage, a large language model (Design Agent) plans objects with layouts, and then a video gener- ation model synthesizes videos in the GENERATION stage. The REDESIGN stage is the most challenging stage that aims to verify the generated videos, suggest corrections, and re- design the text prompts, frame-wise layouts, and guidance scales for the next iteration of generation. To avoid halluci- nation of single-agent and naive multi-agent frameworks, we apply a division-of-labor strategy in this stage by introducing a sequence of specialized agents, executed by MLLMs (mul- timodal large language models): Verification Agent, Sugges- tion Agent, Correction Agent, and Output Structuring Agent. Furthermore, to tackle diverse scenarios of compositional text-to-video generation, we design a self-routing mechanism to adaptively select the proper correction agent from a suite of correction agents, each specialized for one scenario. Ex- tensive experiments demonstrate the effectiveness of GEN- MAC by generating videos based on long compositional text prompts and achieving state-of-the-art in the compositional text-to-video generation benchmark.

Downloads

Published

2026-03-14

How to Cite

Huang, K., Huang, Y., Ning, X., Lin, Z., Wang, Y., & Liu, X. (2026). GENMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration. Proceedings of the AAAI Conference on Artificial Intelligence, 40(7), 5049-5057. https://doi.org/10.1609/aaai.v40i7.37418

Issue

Section

AAAI Technical Track on Computer Vision IV