Yu, F., Tang, J., Yin, W., Sun, Y., Tian, H., Wu, H., & Wang, H. (2021). ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs. Proceedings of the AAAI Conference on Artificial Intelligence, 35(4), 3208-3216. https://doi.org/10.1609/aaai.v35i4.16431