Li, Gen, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. “Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training”. Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (April 3, 2020): 11336-11344. Accessed April 24, 2024. https://ojs.aaai.org/index.php/AAAI/article/view/6795.