[1]

Zheng, S. et al. 2026. GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence. 40, 34 (Mar. 2026), 28857–28865. DOI:https://doi.org/10.1609/aaai.v40i34.40120.