Jang, J., Kong, C., Jeon, D., Kim, S., & Kwak, N. (2023). Unifying Vision-Language Representation Space with Single-Tower Transformer. Proceedings of the AAAI Conference on Artificial Intelligence, 37(1), 980–988. https://doi.org/10.1609/aaai.v37i1.25178