DegVoC: Revisiting Neural Vocoder from a Degradation Perspective

Andong Li; Tong Lei; Lingling Dai; Kai Li; Rilin Chen; Meng Yu; Xiaodong Li; Dong Yu; Chengshi Zheng

doi:10.1609/aaai.v40i37.40416

Authors

Andong Li Institute of Acoustics, Chinese Academy of Sciences University of Chinese Academy of Sciences
Tong Lei Tencent AI Lab
Lingling Dai Institute of Acoustics, Chinese Academy of Sciences University of Chinese Academy of Sciences
Kai Li Tsinghua University
Rilin Chen Tencent AI Lab
Meng Yu Tencent AI Lab
Xiaodong Li Institute of Acoustics, Chinese Academy of Sciences University of Chinese Academy of Sciences
Dong Yu Tencent AI Lab
Chengshi Zheng Institute of Acoustics, Chinese Academy of Sciences University of Chinese Academy of Sciences

DOI:

https://doi.org/10.1609/aaai.v40i37.40416

Abstract

Existing neural vocoders have demonstrated promising performance by leveraging Mel-spectrum as an acoustic feature for conditional audio generation. Nonetheless, they remain constrained by an inherent ``performance-cost'' dilemma that significantly hinders the development of this field. This paper revisits this foundational task from a novel degradation perspective, where Mel-spectrum is regarded as a special signal degradation process from the target spectrum. Drawing inspiration from traditional sparse signal recovery problems, we propose DegVoC, a GAN-based neural vocoder with a two-step solution procedure. First, by exploiting degradation priors, we attempt to retrieve the initial spectral structure from Mel-domain representations as an initial solution via a simple linear transformation. Based on that, we introduce a deep prior solver that accounts for the heterogeneous distribution of sub-bands in the time-frequency domain. A convolution-style attention module with a large kernel size is specially devised for efficient inter-frame and inter-band contextual modeling. With 3.89 M parameters and substantially reduced inference complexity, DegVoC achieves state-of-the-art performance across objective and subjective evaluations, outperforming existing GAN-, DDPM- and flow-matching-based baselines.

DegVoC: Revisiting Neural Vocoder from a Degradation Perspective

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information