Drop the Beat! Freestyler for Accompaniment Conditioned Rapping Voice Generation

Authors

  • Ziqian Ning Northwest Polytechnical University Microsoft
  • Shuai Wang Shenzhen Research Institute of Big Data
  • Yuepeng Jiang Northwest Polytechnical University
  • Jixun Yao Northwest Polytechnical University
  • Lei He Microsoft
  • Shifeng Pan Microsoft
  • Jie Ding Microsoft
  • Lei Xie Northwest Polytechnical University

DOI:

https://doi.org/10.1609/aaai.v39i23.34680

Abstract

Rap, a prominent genre of vocal performance, remains underexplored in vocal generation. General vocal synthesis depends on precise note and duration inputs, requiring users to have related musical knowledge, which limits flexibility. In contrast, rap typically features simpler melodies, with a core focus on a strong rhythmic sense that harmonizes with accompanying beats. In this paper, we propose Freestyler, the first system that generates rapping vocals directly from lyrics and accompaniment inputs. Freestyler utilizes language model-based token generation, followed by a conditional flow matching model to produce spectrograms and a neural vocoder to restore audio. It allows a 3-second prompt to enable zero-shot timbre control. Due to the scarcity of publicly available rap datasets, we also present RapBank, a rap song dataset collected from the internet, alongside a meticulously designed processing pipeline. Experimental results show that Freestyler produces high-quality rapping voice generation with enhanced naturalness and strong alignment with accompanying beats, both stylistically and rhythmically.

Published

2025-04-11

How to Cite

Ning, Z., Wang, S., Jiang, Y., Yao, J., He, L., Pan, S., … Xie, L. (2025). Drop the Beat! Freestyler for Accompaniment Conditioned Rapping Voice Generation. Proceedings of the AAAI Conference on Artificial Intelligence, 39(23), 24966–24974. https://doi.org/10.1609/aaai.v39i23.34680

Issue

Section

AAAI Technical Track on Natural Language Processing II