Yamazaki, K., K. Vo, Q. S. Truong, B. Raj, and N. Le. “VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning”. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 3, June 2023, pp. 3081-90, doi:10.1609/aaai.v37i3.25412.