Yu, J., Li, S., Han, M., Yin, Y., Song, W., Jia, C., & Lan, M. (2026). Activating Visual Context and Commonsense Reasoning Through Masked Prediction in VLMs. Proceedings of the AAAI Conference on Artificial Intelligence, 40(33), 27952–27960. https://doi.org/10.1609/aaai.v40i33.40019