Walk Before You Dance: High-fidelity and Editable Dance Synthesis via Generative Masked Motion Prior
DOI:
https://doi.org/10.1609/aaai.v40i11.37833Abstract
Recent advances in dance generation have enabled the automatic synthesis of 3D dance motions. However, existing methods still face significant challenges in simultaneously achieving high realism, precise dance-music synchronization, diverse motion expression, and physical plausibility. To address these limitations, we propose a novel approach that leverages a generative masked text-to-motion model as a distribution prior to learn a probabilistic mapping from diverse guidance signals, including music, genre, and pose, into high-quality dance motion sequences. Our framework also supports semantic motion editing, such as motion inpainting and body part modification. Specifically, we introduce a multi-tower masked motion model that integrates a text-conditioned masked motion backbone with two parallel, modality-specific branches: a music-guidance tower and a pose-guidance tower. The model is trained using synchronized and progressive masked training, which allows effective infusion of the pretrained text-to-motion prior into the dance synthesis process while enabling each guidance branch to optimize independently through its own loss function, mitigating gradient interference. During inference, we introduce classifier-free logits guidance and pose-guided token optimization to strengthen the influence of music, genre, and pose signals. Extensive experiments demonstrate that our method sets a new state of the art in dance generation, significantly advancing the quality and editability over existing approaches.Published
2026-03-14
How to Cite
Shah, F. N., Shah, P. N., Saleem, M. U., Pinyoanuntapong, E., Wang, P., Xue, H., & Helmy, A. (2026). Walk Before You Dance: High-fidelity and Editable Dance Synthesis via Generative Masked Motion Prior. Proceedings of the AAAI Conference on Artificial Intelligence, 40(11), 8796–8804. https://doi.org/10.1609/aaai.v40i11.37833
Issue
Section
AAAI Technical Track on Computer Vision VIII