[1]

H. Shen, T. Zhao, M. Zhu, and J. Yin, “GroundVLP: Harnessing Zero-Shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection”, AAAI, vol. 38, no. 5, pp. 4766-4775, Mar. 2024.