DegVoC: Revisiting Neural Vocoder from a Degradation Perspective

Authors

  • Andong Li Institute of Acoustics, Chinese Academy of Sciences University of Chinese Academy of Sciences
  • Tong Lei Tencent AI Lab
  • Lingling Dai Institute of Acoustics, Chinese Academy of Sciences University of Chinese Academy of Sciences
  • Kai Li Tsinghua University
  • Rilin Chen Tencent AI Lab
  • Meng Yu Tencent AI Lab
  • Xiaodong Li Institute of Acoustics, Chinese Academy of Sciences University of Chinese Academy of Sciences
  • Dong Yu Tencent AI Lab
  • Chengshi Zheng Institute of Acoustics, Chinese Academy of Sciences University of Chinese Academy of Sciences

DOI:

https://doi.org/10.1609/aaai.v40i37.40416

Abstract

Existing neural vocoders have demonstrated promising performance by leveraging Mel-spectrum as an acoustic feature for conditional audio generation. Nonetheless, they remain constrained by an inherent ``performance-cost'' dilemma that significantly hinders the development of this field. This paper revisits this foundational task from a novel degradation perspective, where Mel-spectrum is regarded as a special signal degradation process from the target spectrum. Drawing inspiration from traditional sparse signal recovery problems, we propose DegVoC, a GAN-based neural vocoder with a two-step solution procedure. First, by exploiting degradation priors, we attempt to retrieve the initial spectral structure from Mel-domain representations as an initial solution via a simple linear transformation. Based on that, we introduce a deep prior solver that accounts for the heterogeneous distribution of sub-bands in the time-frequency domain. A convolution-style attention module with a large kernel size is specially devised for efficient inter-frame and inter-band contextual modeling. With 3.89 M parameters and substantially reduced inference complexity, DegVoC achieves state-of-the-art performance across objective and subjective evaluations, outperforming existing GAN-, DDPM- and flow-matching-based baselines.

Published

2026-03-14

How to Cite

Li, A., Lei, T., Dai, L., Li, K., Chen, R., Yu, M., … Zheng, C. (2026). DegVoC: Revisiting Neural Vocoder from a Degradation Perspective. Proceedings of the AAAI Conference on Artificial Intelligence, 40(37), 31510–31518. https://doi.org/10.1609/aaai.v40i37.40416

Issue

Section

AAAI Technical Track on Natural Language Processing II