[1]
Yamazaki, K., Vo, K., Truong, Q.S., Raj, B. and Le, N. 2023. VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning. Proceedings of the AAAI Conference on Artificial Intelligence. 37, 3 (Jun. 2023), 3081-3090. DOI:https://doi.org/10.1609/aaai.v37i3.25412.