MMG-VL: A Vision-Language Driven Approach for Multi-Person Motion Generation

Songyuan Yang; Wanrong Huang; Yinuo Liu; Zhang Ke-Di; Xihuai He; Shaowu Yang; Huibin Tan

doi:10.1609/aaai.v40i14.38156

Authors

Songyuan Yang National University of Defense Technology
Wanrong Huang National University of Defense Technology
Yinuo Liu National University of Defense Technology
Zhang Ke-Di National University of Defense Technology
Xihuai He National University of Defense Technology
Shaowu Yang National University of Defense Technology
Huibin Tan National University of Defense Technology

DOI:

https://doi.org/10.1609/aaai.v40i14.38156

Abstract

Generating realistic and coordinated 3D human motion for multiple individuals within complex environments remains a significant challenge. Existing text-to-motion methods are often ``blind'' to the physical scene, leading to implausible motions, while scene-conditioned (HSI) approaches demand cumbersome full 3D data and largely neglect multi-person dynamics. To address these limitations, we introduce the VL2Motion paradigm and its embodiment, MMG-VL, a hierarchical framework that generates coordinated multi-person motions from the most accessible inputs: a single 2D image and natural language. MMG-VL first employs a Scene-Aware Intent Planner (SAIP) to interpret the visual context and decompose the user's command into a set of spatially-grounded, multi-person action blueprints. Subsequently, a Coordinated Motion Synthesizer (CMS) translates these blueprints into high-fidelity 3D motion sequences. The synergy between these stages is driven by two novel loss functions: a Spatial-Semantic Grounding Loss to ensure the planner's output is grounded in visual reality, and a Coordinated Environmental Realism Loss that enforces physical constraints and coherent group dynamics during synthesis. To facilitate this research, we introduce HumanVL, the first large-scale dataset featuring multi-person activities in multi-room scenes, providing aligned images, text, blueprints, 3D motions, and scene geometry. Extensive experiments demonstrate that MMG-VL significantly outperforms existing methods in generating spatially coherent, physically realistic, and coordinated multi-person motions, paving the way for more scalable and intuitive creation of dynamic virtual worlds.

MMG-VL: A Vision-Language Driven Approach for Multi-Person Motion Generation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information