Conditional Variational Autoencoder for Sign Language Translation with Cross-Modal Alignment

Authors

  • Rui Zhao School of Informatics, Xiamen University, China Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China
  • Liang Zhang School of Informatics, Xiamen University, China Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China
  • Biao Fu School of Informatics, Xiamen University, China Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China
  • Cong Hu School of Informatics, Xiamen University, China Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China
  • Jinsong Su School of Informatics, Xiamen University, China Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China
  • Yidong Chen School of Informatics, Xiamen University, China Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China

DOI:

https://doi.org/10.1609/aaai.v38i17.29937

Keywords:

NLP: Language Grounding & Multi-modal NLP, NLP: Machine Translation, Multilinguality, Cross-Lingual NLP

Abstract

Sign language translation (SLT) aims to convert continuous sign language videos into textual sentences. As a typical multi-modal task, there exists an inherent modality gap between sign language videos and spoken language text, which makes the cross-modal alignment between visual and textual modalities crucial. However, previous studies tend to rely on an intermediate sign gloss representation to help alleviate the cross-modal problem thereby neglecting the alignment across modalities that may lead to compromised results. To address this issue, we propose a novel framework based on Conditional Variational autoencoder for SLT (CV-SLT) that facilitates direct and sufficient cross-modal alignment between sign language videos and spoken language text. Specifically, our CV-SLT consists of two paths with two Kullback-Leibler (KL) divergences to regularize the outputs of the encoder and decoder, respectively. In the prior path, the model solely relies on visual information to predict the target text; whereas in the posterior path, it simultaneously encodes visual information and textual knowledge to reconstruct the target text. The first KL divergence optimizes the conditional variational autoencoder and regularizes the encoder outputs, while the second KL divergence performs a self-distillation from the posterior path to the prior path, ensuring the consistency of decoder outputs.We further enhance the integration of textual information to the posterior path by employing a shared Attention Residual Gaussian Distribution (ARGD), which considers the textual information in the posterior path as a residual component relative to the prior path. Extensive experiments conducted on public datasets demonstrate the effectiveness of our framework, achieving new state-of-the-art results while significantly alleviating the cross-modal representation discrepancy. The code and models are available at https://github.com/rzhao-zhsq/CV-SLT.

Published

2024-03-24

How to Cite

Zhao, R., Zhang, L., Fu, B., Hu, C., Su, J., & Chen, Y. (2024). Conditional Variational Autoencoder for Sign Language Translation with Cross-Modal Alignment. Proceedings of the AAAI Conference on Artificial Intelligence, 38(17), 19643-19651. https://doi.org/10.1609/aaai.v38i17.29937

Issue

Section

AAAI Technical Track on Natural Language Processing II