Yamazaki, Kashu, Khoa Vo, Quang Sang Truong, Bhiksha Raj, and Ngan Le. 2023. “VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning”. Proceedings of the AAAI Conference on Artificial Intelligence 37 (3):3081-90. https://doi.org/10.1609/aaai.v37i3.25412.