GENMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

Kaiyi Huang; Yukun Huang; Xuefei Ning; Zinan Lin; Yu Wang; Xihui Liu

doi:10.1609/aaai.v40i7.37418

Authors

Kaiyi Huang University of Hong Kong
Yukun Huang University of Hong Kong
Xuefei Ning Tsinghua University
Zinan Lin Microsoft
Yu Wang Tsinghua University, Tsinghua University
Xihui Liu University of Hong Kong

DOI:

https://doi.org/10.1609/aaai.v40i7.37418

Abstract

Text-to-video generation models have shown significant progress in recent years. However, they still struggle with compositional text prompts, such as attribute binding for multiple objects, temporal dynamics associated with differ- ent objects, and interactions between objects. Inspired by ef- fective human creative workflow, we propose GENMAC, a multi-agent collaboration framework that enables composi- tional text-to-video generation. The framework incorporates a three-stage collaborative workflow: DESIGN, GENERATION, and REDESIGN, with an iterative loop between the latter two stages to progressively verify and refine the generated videos. In the DESIGN stage, a large language model (Design Agent) plans objects with layouts, and then a video gener- ation model synthesizes videos in the GENERATION stage. The REDESIGN stage is the most challenging stage that aims to verify the generated videos, suggest corrections, and re- design the text prompts, frame-wise layouts, and guidance scales for the next iteration of generation. To avoid halluci- nation of single-agent and naive multi-agent frameworks, we apply a division-of-labor strategy in this stage by introducing a sequence of specialized agents, executed by MLLMs (mul- timodal large language models): Verification Agent, Sugges- tion Agent, Correction Agent, and Output Structuring Agent. Furthermore, to tackle diverse scenarios of compositional text-to-video generation, we design a self-routing mechanism to adaptively select the proper correction agent from a suite of correction agents, each specialized for one scenario. Ex- tensive experiments demonstrate the effectiveness of GEN- MAC by generating videos based on long compositional text prompts and achieving state-of-the-art in the compositional text-to-video generation benchmark.

GENMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information