Xu, Ran, Caiming Xiong, Wei Chen, and Jason Corso. 2015. “Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework”. Proceedings of the AAAI Conference on Artificial Intelligence 29 (1). https://doi.org/10.1609/aaai.v29i1.9512.