[1]

Y. Huang, “Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations”, AAAI, vol. 38, no. 3, pp. 2417–2425, Mar. 2024.