Shamsolmoali, P., Zareapoor, M., Granger, E., & Felsberg, M. (2024). SeTformer Is What You Need for Vision and Language. Proceedings of the AAAI Conference on Artificial Intelligence, 38(5), 4713–4721. https://doi.org/10.1609/aaai.v38i5.28272