[1]
S. Zheng, “GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models”, AAAI, vol. 40, no. 34, pp. 28857–28865, Mar. 2026.