[1]

Z. Sun, P. Sarma, W. Sethares, and Y. Liang, “Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis”, AAAI, vol. 34, no. 05, pp. 8992-8999, Apr. 2020.