MMG-VL: A Vision-Language Driven Approach for Multi-Person Motion Generation

Authors

  • Songyuan Yang National University of Defense Technology
  • Wanrong Huang National University of Defense Technology
  • Yinuo Liu National University of Defense Technology
  • Zhang Ke-Di National University of Defense Technology
  • Xihuai He National University of Defense Technology
  • Shaowu Yang National University of Defense Technology
  • Huibin Tan National University of Defense Technology

DOI:

https://doi.org/10.1609/aaai.v40i14.38156

Abstract

Generating realistic and coordinated 3D human motion for multiple individuals within complex environments remains a significant challenge. Existing text-to-motion methods are often ``blind'' to the physical scene, leading to implausible motions, while scene-conditioned (HSI) approaches demand cumbersome full 3D data and largely neglect multi-person dynamics. To address these limitations, we introduce the VL2Motion paradigm and its embodiment, MMG-VL, a hierarchical framework that generates coordinated multi-person motions from the most accessible inputs: a single 2D image and natural language. MMG-VL first employs a Scene-Aware Intent Planner (SAIP) to interpret the visual context and decompose the user's command into a set of spatially-grounded, multi-person action blueprints. Subsequently, a Coordinated Motion Synthesizer (CMS) translates these blueprints into high-fidelity 3D motion sequences. The synergy between these stages is driven by two novel loss functions: a Spatial-Semantic Grounding Loss to ensure the planner's output is grounded in visual reality, and a Coordinated Environmental Realism Loss that enforces physical constraints and coherent group dynamics during synthesis. To facilitate this research, we introduce HumanVL, the first large-scale dataset featuring multi-person activities in multi-room scenes, providing aligned images, text, blueprints, 3D motions, and scene geometry. Extensive experiments demonstrate that MMG-VL significantly outperforms existing methods in generating spatially coherent, physically realistic, and coordinated multi-person motions, paving the way for more scalable and intuitive creation of dynamic virtual worlds.

Downloads

Published

2026-03-14

How to Cite

Yang, S., Huang, W., Liu, Y., Ke-Di, Z., He, X., Yang, S., & Tan, H. (2026). MMG-VL: A Vision-Language Driven Approach for Multi-Person Motion Generation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(14), 11712-11720. https://doi.org/10.1609/aaai.v40i14.38156

Issue

Section

AAAI Technical Track on Computer Vision XI