Xu, Ran, Caiming Xiong, Wei Chen, and Jason Corso. “Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework”. Proceedings of the AAAI Conference on Artificial Intelligence 29, no. 1 (February 19, 2015). Accessed April 17, 2024. https://ojs.aaai.org/index.php/AAAI/article/view/9512.