DegVoC: Revisiting Neural Vocoder from a Degradation Perspective
DOI:
https://doi.org/10.1609/aaai.v40i37.40416Abstract
Existing neural vocoders have demonstrated promising performance by leveraging Mel-spectrum as an acoustic feature for conditional audio generation. Nonetheless, they remain constrained by an inherent ``performance-cost'' dilemma that significantly hinders the development of this field. This paper revisits this foundational task from a novel degradation perspective, where Mel-spectrum is regarded as a special signal degradation process from the target spectrum. Drawing inspiration from traditional sparse signal recovery problems, we propose DegVoC, a GAN-based neural vocoder with a two-step solution procedure. First, by exploiting degradation priors, we attempt to retrieve the initial spectral structure from Mel-domain representations as an initial solution via a simple linear transformation. Based on that, we introduce a deep prior solver that accounts for the heterogeneous distribution of sub-bands in the time-frequency domain. A convolution-style attention module with a large kernel size is specially devised for efficient inter-frame and inter-band contextual modeling. With 3.89 M parameters and substantially reduced inference complexity, DegVoC achieves state-of-the-art performance across objective and subjective evaluations, outperforming existing GAN-, DDPM- and flow-matching-based baselines.Downloads
Published
2026-03-14
How to Cite
Li, A., Lei, T., Dai, L., Li, K., Chen, R., Yu, M., … Zheng, C. (2026). DegVoC: Revisiting Neural Vocoder from a Degradation Perspective. Proceedings of the AAAI Conference on Artificial Intelligence, 40(37), 31510–31518. https://doi.org/10.1609/aaai.v40i37.40416
Issue
Section
AAAI Technical Track on Natural Language Processing II