Yu, Fei, et al. “ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs”. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 4, May 2021, pp. 3208-16, doi:10.1609/aaai.v35i4.16431.