UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation

Jinting Wang; Shan Yang; Chenxing Li; Dong Yu; Li Liu

doi:10.1609/aaai.v40i39.40643

Authors

Jinting Wang The Hong Kong University of Technology and Science (Guangzhou)
Shan Yang Tencent AI Lab
Chenxing Li Tencent AI Lab
Dong Yu Tencent AI Lab
Li Liu The Hong Kong University of Science and Technology (Guangzhou)

DOI:

https://doi.org/10.1609/aaai.v40i39.40643

Abstract

Cued Speech (CS) enhances lipreading via hand coding, offering visual phonemic cues that support precise speech perception for the hearing-impaired. The task of CS Video-to-Speech generation (CSV2S) aims to convert CS videos into intelligible speech signals. Most existing research focuses on CS Recognition (CSR), which transcribes video content into text. Consequently, a common solution for CSV2S is to integrate CSR with a text-to-speech (TTS) system. However, this pipeline relies on text as an intermediate medium, which may lead to error propagation and temporal misalignment between speech and CS video dynamics. In contrast, directly generating audio speech from CS video (direct CSV2S) often suffer from the inherent multimodal complexity and the limited availability of CS data. To address these challenges, we propose UniCUE, the first unified framework for CSV2S that directly generates speech from CS videos without relying on intermediate text. The core innovation of UniCUE lies in integrating a understanding task (CSR) that provides fine-grained CS visual-semantic cues to to guide the speech generation. Specifically, UniCUE incorporates a pose-aware visual processor, a semantic alignment pool that enables precise visual–semantic mapping, and a VisioPhonetic adapter to bridge the understanding and generation tasks within a unified architecture. To support this framework, we construct UniCUE-HI, a large-scale Mandarin CS dataset containing 11,282 videos from 14 cuers, including both hearing-impaired and normal-hearing individuals. Extensive experiments conducted on this dataset demonstrate that UniCUE achieves state-of-the-art (SOTA) performance across multiple evaluation metrics.

UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information