Drop the Beat! Freestyler for Accompaniment Conditioned Rapping Voice Generation

Ziqian Ning; Shuai Wang; Yuepeng Jiang; Jixun Yao; Lei He; Shifeng Pan; Jie Ding; Lei Xie

doi:10.1609/aaai.v39i23.34680

Authors

Ziqian Ning Northwest Polytechnical University Microsoft
Shuai Wang Shenzhen Research Institute of Big Data
Yuepeng Jiang Northwest Polytechnical University
Jixun Yao Northwest Polytechnical University
Lei He Microsoft
Shifeng Pan Microsoft
Jie Ding Microsoft
Lei Xie Northwest Polytechnical University

DOI:

https://doi.org/10.1609/aaai.v39i23.34680

Abstract

Rap, a prominent genre of vocal performance, remains underexplored in vocal generation. General vocal synthesis depends on precise note and duration inputs, requiring users to have related musical knowledge, which limits flexibility. In contrast, rap typically features simpler melodies, with a core focus on a strong rhythmic sense that harmonizes with accompanying beats. In this paper, we propose Freestyler, the first system that generates rapping vocals directly from lyrics and accompaniment inputs. Freestyler utilizes language model-based token generation, followed by a conditional flow matching model to produce spectrograms and a neural vocoder to restore audio. It allows a 3-second prompt to enable zero-shot timbre control. Due to the scarcity of publicly available rap datasets, we also present RapBank, a rap song dataset collected from the internet, alongside a meticulously designed processing pipeline. Experimental results show that Freestyler produces high-quality rapping voice generation with enhanced naturalness and strong alignment with accompanying beats, both stylistically and rhythmically.

Drop the Beat! Freestyler for Accompaniment Conditioned Rapping Voice Generation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information