Li, G., N. Duan, Y. Fang, M. Gong, and D. Jiang. “Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training”. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, Apr. 2020, pp. 11336-44, doi:10.1609/aaai.v34i07.6795.