Jang, Jiho, Chaerin Kong, DongHyeon Jeon, Seonhoon Kim, and Nojun Kwak. “Unifying Vision-Language Representation Space With Single-Tower Transformer”. Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 1 (June 26, 2023): 980–988. Accessed July 20, 2026. https://ojs.aaai.org/index.php/AAAI/article/view/25178.