[1]

F. Yu, “ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs”, AAAI, vol. 35, no. 4, pp. 3208-3216, May 2021.