Zheng, S., Zhu, Y., Zhao, H., Yang, F., Zhan, Y., Tang, M., & Wang, J. (2026). GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(34), 28857–28865. https://doi.org/10.1609/aaai.v40i34.40120